Getting started

date()
## [1] "Wed Nov 30 13:07:45 2022"

This IODS2022 is an introductory course to open data science and the open tools commonly used in it.

Link to my GitHub repository: https://github.com/jmleppal/IODS-project

I went through the R for the Health Data Science and got a some grasp on elementary use of R. All in all, the book is an excellent introduction to this topic. I’m trying to understand the intestines of GitHub but at the moment it looks to me as a long path of trial and error. When it comes to R Markdown, working on this visual-mode seems to be easier now in the beginning.


Regression and model validation

knitr::opts_chunk$set(message = FALSE)
date()
## [1] "Wed Nov 30 13:07:45 2022"
# READING IN THE ORIGINAL DATASET

# Read the data into memory
learning2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t", header=TRUE)

# Look at the dimensions and structure of the data
dim(learning2014)
## [1] 183  60
# str(learning2014)

The dataset used in this exercise is based on an international survey of students’ approaches to learning in 2013-2015. Measures in the parts A, B and C come from the ASSIST (Approaches and Study Skills Inventory for Students) questionnaire and the part D is based on the SATS (Survey of Attitudes Toward Statistics) questionnaire. Some background and other variables have been omitted, but all 183 responses are included. So all in all, there are 60 variables and 183 observations included. All variables except the one on gender are integers. No values are missing.

# CREATING THE ANALYSIS DATASET

# Access the dplyr library
library(dplyr)

# Create an analysis dataset
deep_questions <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30","D06",  "D15", "D23", "D31")
surface_questions <- c("SU02","SU10","SU18","SU26", "SU05","SU13","SU21","SU29","SU08","SU16","SU24","SU32")
strategic_questions <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20","ST28")

learn2014an <- learning2014 %>% select(gender, Age, Points)
learn2014an$deep <- rowMeans(learning2014[, deep_questions])
learn2014an$surf <- rowMeans(learning2014[, surface_questions])
learn2014an$stra <- rowMeans(learning2014[, strategic_questions]) 

# create the column 'attitude' by scaling the column "Attitude"
learn2014an$attitude <- learning2014$Attitude / 10

# print out the column names of the data
colnames(learn2014an)
## [1] "gender"   "Age"      "Points"   "deep"     "surf"     "stra"     "attitude"
# change the names of certain columns
colnames(learn2014an)[2] <- "age"
colnames(learn2014an)[3] <- "points"

# select rows where points is greater than zero
learn2014an <- filter(learn2014an, points > 0)
 
# Look at the dimensions and structure of the analysis data
# dim(learn2014an)
str(learn2014an)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...

In the analysis dataset, variables on gender, age and points are unmodified but the variable on attitude is modified by dividing the original one by 10, and new variables deep, surf and stra are formed by averaging the respective part B variables. Attitude measures the global attitude towards statistics. Deep is a measure on how inclined the participant is towards deep learning, surf is a measure on how superficially the participant approaches his or her studying, and stra is a measure on how organized the participant’s manner is in studying. Points are the points obtained in an exam in a course on Introduction to Social Statistics.

The value 0 in points actually means that the participant did not participate in the exam (n = 17, 9%). These participants are excluded from the analysis dataset. Thus the dataset consists of 7 variables and 166 observations.

# SAVING THE ANALYSIS DATASET AND CHECKING THAT READING ANEW WORKS

# Set the working directory for saving and data
setwd("C:/Users/yona2/Desktop/FT/IODS2022/RStudio/IODS-project/data")

# Access the tidyverse library
library(tidyverse)

# Save the analysis dataset
write.csv(learn2014an,"learn2014an.csv", row.names = FALSE)

# Checking that reading of the saved file works
# learn2014an2 <- read.table("learn2014an.csv", sep=",", header=TRUE)
# str(learn2014an2)
# head(learn2014an2)

Saving and reading the saved file work as expected.

# EXPLORING THE ANALYSIS DATA

# Access the GGally, ggplot2 and cowplot libraries
library(GGally)
library(ggplot2)

# Get summary values
learn2014an %>% group_by(gender) %>% summarise(count= n())
## # A tibble: 2 × 2
##   gender count
##   <chr>  <int>
## 1 F        110
## 2 M         56
summarise_at(learn2014an, vars(attitude:points), tibble::lst(mean, sd))
##   attitude_mean stra_mean surf_mean deep_mean points_mean attitude_sd   stra_sd
## 1      3.142771  3.121235  2.787149  3.679719    22.71687   0.7299079 0.7718318
##     surf_sd   deep_sd points_sd
## 1 0.5288405 0.5541369  5.894884
learn2014an %>% group_by(gender) %>% summarise_at(vars(age), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
##   gender  mean    sd   min median   max
##   <chr>  <dbl> <dbl> <int>  <dbl> <int>
## 1 F       24.9  7.36    17     22    53
## 2 M       26.8  8.43    19     24    55
learn2014an %>% group_by(gender) %>% summarise_at(vars(attitude), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
##   gender  mean    sd   min median   max
##   <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 F       2.99 0.725   1.4   2.95   5  
## 2 M       3.44 0.646   1.7   3.4    4.8
learn2014an %>% group_by(gender) %>% summarise_at(vars(deep), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
##   gender  mean    sd   min median   max
##   <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 F       3.66 0.535  1.58   3.67  4.75
## 2 M       3.72 0.592  2.08   3.79  4.92
learn2014an %>% group_by(gender) %>% summarise_at(vars(surf), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
##   gender  mean    sd   min median   max
##   <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 F       2.83 0.458  1.83   2.83  4   
## 2 M       2.70 0.642  1.58   2.62  4.33
learn2014an %>% group_by(gender) %>% summarise_at(vars(stra), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
##   gender  mean    sd   min median   max
##   <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 F       3.20 0.748  1.38   3.25   5  
## 2 M       2.96 0.801  1.25   3      4.5
learn2014an %>% group_by(gender) %>% summarise_at(vars(points), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
##   gender  mean    sd   min median   max
##   <chr>  <dbl> <dbl> <int>  <dbl> <int>
## 1 F       22.3  5.83     7   23      33
## 2 M       23.5  6.01     9   23.5    33
# Create and draw a more advanced plot matrix with ggpairs()
p <- ggpairs(learn2014an[, c(1, 2, 7, 4, 5, 6, 3)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p

# Initialize plot with data and aesthetic mapping
p1 <- ggplot(learn2014an, aes(x = attitude, y = points))

# Define the visualization type (points) + add a regression line
# + add a main title and draw the plot
p2 <- p1 + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's attitude versus exam points")
p2

Out of 166 participants, 56 (34%) are men and 110 (66%) women. The age ranges from 17 to 55 years, the average age (sd) being 26.8 (8.4) years for men and 24.9 (7.4) years for women. Mean scores (sd) are 3.1 (0.7) for attitude, 3.7 (0.6) for deep, 2.8 (0.5) for surf, 3.1 (0.8) for stra and 22.7 (5.9) for points. The distribution of age is skewed to the right, otherwise the distributions are reasonably symmetrical and do not noteworthy differ between gender.

Attitude is significantly correlated with points (rho = 0.44), which association can be seen nicely in the respective plot. On the other hand, surf is slightly negatively (rho -0.14) and stra slightly positively (rho = 0.15) correlated with points, as expected. There is statistically significant negative correlation between surf and attitude (rho = -0.18), deep (rho = -0.32) and stra (rho = -0.16), as expected. Otherwise the associations are unremarkable.

# BUILDING A MULTIPLE REGRESSION MODEL

# Fit a linear model
my_model1 <- lm(points ~ age, data = learn2014an)
my_model2 <- lm(points ~ gender, data = learn2014an)
my_model3 <- lm(points ~ attitude, data = learn2014an)
my_model4 <- lm(points ~ deep, data = learn2014an)
my_model5 <- lm(points ~ surf, data = learn2014an)
my_model6 <- lm(points ~ stra, data = learn2014an)

my_model31 <- lm(points ~ attitude + stra + surf, data = learn2014an)
my_model32 <- lm(points ~ attitude + age + gender, data = learn2014an)

# Print out a summary of the models
summary(my_model1)
## 
## Call:
## lm(formula = points ~ age, data = learn2014an)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.0360  -3.7531   0.0958   4.6762  10.8128 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.52150    1.57339  15.585   <2e-16 ***
## age         -0.07074    0.05901  -1.199    0.232    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.887 on 164 degrees of freedom
## Multiple R-squared:  0.008684,   Adjusted R-squared:  0.00264 
## F-statistic: 1.437 on 1 and 164 DF,  p-value: 0.2324
summary(my_model2)
## 
## Call:
## lm(formula = points ~ gender, data = learn2014an)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3273  -3.3273   0.5179   4.5179  10.6727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.3273     0.5613  39.776   <2e-16 ***
## genderM       1.1549     0.9664   1.195    0.234    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.887 on 164 degrees of freedom
## Multiple R-squared:  0.008632,   Adjusted R-squared:  0.002587 
## F-statistic: 1.428 on 1 and 164 DF,  p-value: 0.2338
summary(my_model3)
## 
## Call:
## lm(formula = points ~ attitude, data = learn2014an)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
summary(my_model4)
## 
## Call:
## lm(formula = points ~ deep, data = learn2014an)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6913  -3.6935   0.2862   4.9957  10.3537 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.1141     3.0908   7.478 4.31e-12 ***
## deep         -0.1080     0.8306  -0.130    0.897    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.913 on 164 degrees of freedom
## Multiple R-squared:  0.000103,   Adjusted R-squared:  -0.005994 
## F-statistic: 0.01689 on 1 and 164 DF,  p-value: 0.8967
summary(my_model5)
## 
## Call:
## lm(formula = points ~ surf, data = learn2014an)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6539  -3.3744   0.3574   4.4734  10.2234 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.2017     2.4432  11.134   <2e-16 ***
## surf         -1.6091     0.8613  -1.868   0.0635 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.851 on 164 degrees of freedom
## Multiple R-squared:  0.02084,    Adjusted R-squared:  0.01487 
## F-statistic:  3.49 on 1 and 164 DF,  p-value: 0.06351
summary(my_model6)
## 
## Call:
## lm(formula = points ~ stra, data = learn2014an)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.5581  -3.8198   0.1042   4.3024  10.1394 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   19.233      1.897  10.141   <2e-16 ***
## stra           1.116      0.590   1.892   0.0603 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.849 on 164 degrees of freedom
## Multiple R-squared:  0.02135,    Adjusted R-squared:  0.01538 
## F-statistic: 3.578 on 1 and 164 DF,  p-value: 0.06031
summary(my_model31)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learn2014an)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
summary(my_model32)
## 
## Call:
## lm(formula = points ~ attitude + age + gender, data = learn2014an)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4590  -3.3221   0.2186   4.0247  10.4632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.42910    2.29043   5.863 2.48e-08 ***
## attitude     3.60657    0.59322   6.080 8.34e-09 ***
## age         -0.07586    0.05367  -1.414    0.159    
## genderM     -0.33054    0.91934  -0.360    0.720    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.187 
## F-statistic: 13.65 on 3 and 162 DF,  p-value: 5.536e-08
summary(my_model3)
## 
## Call:
## lm(formula = points ~ attitude, data = learn2014an)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
# Draw diagnostic plots using the plot() function, choosing the plots 1, 2 and 5
par(mfrow = c(1,3))
plot(my_model3, which = c(1, 2, 5))

In univariate regression models, only attitude has a statistically significant association with points (p = 4.12e-09). Surf and stra are marginally associated with points (p = 0.06 for both).

Attitude, surf and stra are selected as explanatory variables for a multiple regression model. In the model, attitude preserves its statistical significance while surf and stra do not add much to it. The overall fitness of the model do not improve: there is practically no change in multiple R-squared (0.21 in the multiple regression and 0.19 in the univariate model) and the overall p actually worsens (3.16e-08 in the multiple regression and 4.12e-09 in the univariate model). Likewise, adding demographic factors age and gender to the model do not improve it.

Therefore, for the final model, only attitude is selected as an explanatory variable. An increase of one unit in attitude is associated with an increase of 3.5 points in the exam results. Based on the multiple R-squared statistics, scores in attitude explained 19 % of the variation in points. Even though this “explanatory power” is rather small, the model was statistically highly significant (F = 38.6 with 1 and 164 df, p = 4.12e-09). This kind of result is rather typical for human behavioral data.

In a linear regression model, the association between the dependant and explanatory variable are assumed to be linear and monotonously increasing or decreasing. The included variables should be normally distributed and the explanatory variables uncorrelated with each other. The diagnostic plots did not show any noteworthy deviance; there were a couple of outliers that did not seem to distort the association between attitude and points. The distributions of both attitude and points were reasonably symmetrical and close enough to normal. In the scatter plot on points by attitude, the association seemed clearly monotonously increasing.


Logistic regression

knitr::opts_chunk$set(message = FALSE)
date()
## [1] "Wed Nov 30 13:08:00 2022"
# READING IN THE DATASET

# set the working directory
 setwd("C:/Users/yona2/Desktop/FT/IODS2022/RStudio/IODS-project/data")

# access the tidyverse library
 library(tidyverse)

# read the data
 alc <- read.table("alc.csv", sep = ",", header = TRUE)

# check structure and dimensions
colnames(alc)
##  [1] "X"          "school"     "sex"        "age"        "address"   
##  [6] "famsize"    "Pstatus"    "Medu"       "Fedu"       "Mjob"      
## [11] "Fjob"       "reason"     "guardian"   "traveltime" "studytime" 
## [16] "schoolsup"  "famsup"     "activities" "nursery"    "higher"    
## [21] "internet"   "romantic"   "famrel"     "freetime"   "goout"     
## [26] "Dalc"       "Walc"       "health"     "failures"   "paid"      
## [31] "absences"   "G1"         "G2"         "G3"         "alc_use"   
## [36] "high_use"
# glimpse(alc)

The data is based on two surveys concerning student achievement in secondary education in two Portuguese schools. The dataset contains factors related to performances in mathematics and Portuguese language. The variables failures, paid, absences, G1, G2 and G3 present a combined assessment on both subjects. The analyses focus on the relationship between alcohol consumption (alc_use; 1 very low to 5 very high) and other variables. Alcohol use is considered high if it exceeds 2 (high_use).

All in all, there are 36 variables and 370 observations included. No values are missing.

As for primary hypotheses, I would expect going out, failures and absences to be positively and final grade (G3) to be negatively associated with alcohol use.

# EXPLORING THE DATA

# access the GGally and ggplot2 libraries
library(GGally); library(ggplot2)

# set summary values
summary(alc[, c(4, 8:9, 14:15, 23:25, 28:29, 31, 34:35)])
##       age             Medu          Fedu         traveltime      studytime    
##  Min.   :15.00   Min.   :0.0   Min.   :0.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:16.00   1st Qu.:2.0   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :17.00   Median :3.0   Median :3.000   Median :1.000   Median :2.000  
##  Mean   :16.58   Mean   :2.8   Mean   :2.557   Mean   :1.446   Mean   :2.043  
##  3rd Qu.:17.00   3rd Qu.:4.0   3rd Qu.:3.750   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :22.00   Max.   :4.0   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##      famrel         freetime         goout           health     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:3.000   1st Qu.:2.000   1st Qu.:3.000  
##  Median :4.000   Median :3.000   Median :3.000   Median :4.000  
##  Mean   :3.935   Mean   :3.224   Mean   :3.116   Mean   :3.562  
##  3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##     failures         absences            G3           alc_use     
##  Min.   :0.0000   Min.   : 0.000   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.: 1.000   1st Qu.:10.00   1st Qu.:1.000  
##  Median :0.0000   Median : 3.000   Median :12.00   Median :1.500  
##  Mean   :0.1892   Mean   : 4.511   Mean   :11.52   Mean   :1.889  
##  3rd Qu.:0.0000   3rd Qu.: 6.000   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :3.0000   Max.   :45.000   Max.   :18.00   Max.   :5.000
alc %>% group_by(sex) %>% summarise(count= n())
## # A tibble: 2 × 2
##   sex   count
##   <chr> <int>
## 1 F       195
## 2 M       175
alc %>% group_by(high_use) %>% summarise(count= n())
## # A tibble: 2 × 2
##   high_use count
##   <lgl>    <int>
## 1 FALSE      259
## 2 TRUE       111
alc %>% group_by(sex, high_use) %>% summarise(count= n())
## # A tibble: 4 × 3
## # Groups:   sex [2]
##   sex   high_use count
##   <chr> <lgl>    <int>
## 1 F     FALSE      154
## 2 F     TRUE        41
## 3 M     FALSE      105
## 4 M     TRUE        70
p1 <- ggpairs(alc[, c(35, 4, 8:9, 14:15, 23:25)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p2 <- ggpairs(alc[, c(35, 28:29, 31, 34)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p3 <- ggpairs(alc[, c(36, 4, 8:9, 14:15, 23:25)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p4 <- ggpairs(alc[, c(36, 28:29, 31, 34:35)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p1; p2; p3; p4

# initialize plot with data and aesthetic mapping
p5 <- ggplot(alc, aes(x = goout, y = alc_use))
p6 <- ggplot(alc, aes(x = failures, y = alc_use))
p7 <- ggplot(alc, aes(x = absences, y = alc_use))
p8 <- ggplot(alc, aes(x = G3, y = alc_use))

# define the visualization type (points) + add a regression line
# + add a main title and draw the plot
p9 <- p5 + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's going out versus alcohol use")
p10 <- p6 + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's failures versus alcohol use")
p11 <- p7 + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's absences versus alcohol use")
p12 <- p8 + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's G3 versus alcohol use")
p9; p10; p11; p12

p13 <- ggpairs(alc[, c(3:4, 25, 29, 31, 34)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p13;

Out of 370 participants, 175 (47%) are men and 195 (52%) women. The median age is 17 years, ranging from 15 to 22 years. 111 (30%) participants use alcohol in high quantities; 70 (40%) men and 41 (21%) women are high users. The distributions (median, range) of going out (3, 1-5) and final grade (3, 1-5) are reasonably symmetrical but failures (0, 0-3) and absences (3, 0-45) are highly skewed to the right.

As expected, in initial exploration, going out (rho = 0.40), failures (rho = 0.21) and absences (rho = 0.21) are positively and final grade (rho = -0.16) negatively associated with alcohol use. When alcohol use is dichotomized, these relationships seem to become attenuated. On the other hand, going out (rho = -0.15), failures (rho = -0.36) and absences (rho = -0.10) are negatively associated with final grade. Futhermore, going out is positively associated with failures (rho = 0.14) and absences (rho = 0.11).

# BUILDING A LOGISTIC REGRESSION MODEL

# find the model with glm()
m1 <- glm(high_use ~ age, data = alc, family = "binomial")
m2 <- glm(high_use ~ sex, data = alc, family = "binomial")
m3 <- glm(high_use ~ goout, data = alc, family = "binomial")
m4 <- glm(high_use ~ failures, data = alc, family = "binomial")
m5 <- glm(high_use ~ absences, data = alc, family = "binomial")
m6 <- glm(high_use ~ G3, data = alc, family = "binomial")

m7 <- glm(high_use ~ age + sex, data = alc, family = "binomial")
m8 <- glm(high_use ~ goout + failures + absences + G3, data = alc, family = "binomial")
m9 <- glm(high_use ~ age + sex + goout + failures + absences, data = alc, family = "binomial")
m10 <- glm(high_use ~ sex + goout + failures + absences, data = alc, family = "binomial")

# print out a summary of the model
summary(m1); summary(m2); summary(m3); summary(m4); summary(m5); 
## 
## Call:
## glm(formula = high_use ~ age, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0968  -0.8705  -0.8020   1.4319   1.6943  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -4.07538    1.60980  -2.532   0.0114 *
## age          0.19414    0.09628   2.016   0.0438 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 447.94  on 368  degrees of freedom
## AIC: 451.94
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = high_use ~ sex, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0108  -1.0108  -0.6871   1.3537   1.7660  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.3234     0.1757  -7.530 5.06e-14 ***
## sexM          0.9179     0.2339   3.925 8.67e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 436.13  on 368  degrees of freedom
## AIC: 440.13
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = high_use ~ goout, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3712  -0.7687  -0.5470   0.9952   2.3037  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.3368     0.4188  -7.968 1.62e-15 ***
## goout         0.7563     0.1165   6.494 8.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 403.05  on 368  degrees of freedom
## AIC: 407.05
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = high_use ~ failures, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6367  -0.7943  -0.7943   1.3143   1.6171  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.9920     0.1234  -8.040 8.99e-16 ***
## failures      0.6759     0.1973   3.425 0.000615 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 439.75  on 368  degrees of freedom
## AIC: 443.75
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = high_use ~ absences, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3598  -0.8209  -0.7318   1.2658   1.7419  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.26945    0.16094  -7.888 3.08e-15 ***
## absences     0.08867    0.02317   3.827  0.00013 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 434.52  on 368  degrees of freedom
## AIC: 438.52
## 
## Number of Fisher Scoring iterations: 4
summary(m6); summary(m7); summary(m8); summary(m9); summary(m10); 
## 
## Call:
## glm(formula = high_use ~ G3, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2365  -0.8551  -0.7651   1.4213   1.7731  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.13783    0.40374   0.341   0.7328  
## G3          -0.08689    0.03466  -2.507   0.0122 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 445.68  on 368  degrees of freedom
## AIC: 449.68
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = high_use ~ age + sex, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3034  -0.8870  -0.7086   1.2275   1.9163  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.72614    1.65095  -2.863   0.0042 ** 
## age          0.20425    0.09812   2.082   0.0374 *  
## sexM         0.93259    0.23552   3.960 7.51e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 431.73  on 367  degrees of freedom
## AIC: 437.73
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = high_use ~ goout + failures + absences + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0077  -0.7367  -0.5427   0.9191   2.3664  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.42137    0.69467  -4.925 8.43e-07 ***
## goout        0.69791    0.11881   5.874 4.25e-09 ***
## failures     0.52078    0.23720   2.195 0.028128 *  
## absences     0.07373    0.02233   3.303 0.000958 ***
## G3          -0.01612    0.04171  -0.386 0.699253    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 383.91  on 365  degrees of freedom
## AIC: 393.91
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = high_use ~ age + sex + goout + failures + absences, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1461  -0.7853  -0.5274   0.7258   2.3609  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.72724    1.89765  -2.491 0.012735 *  
## age          0.03623    0.11375   0.318 0.750115    
## sexM         0.98157    0.26166   3.751 0.000176 ***
## goout        0.69397    0.12157   5.708 1.14e-08 ***
## failures     0.47932    0.23331   2.054 0.039930 *  
## absences     0.08171    0.02276   3.590 0.000331 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 369.40  on 364  degrees of freedom
## AIC: 381.4
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = high_use ~ sex + goout + failures + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1521  -0.7929  -0.5317   0.7412   2.3706  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.14389    0.47881  -8.654  < 2e-16 ***
## sexM         0.97989    0.26156   3.746 0.000179 ***
## goout        0.69801    0.12093   5.772 7.83e-09 ***
## failures     0.48932    0.23073   2.121 0.033938 *  
## absences     0.08246    0.02266   3.639 0.000274 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 369.50  on 365  degrees of freedom
## AIC: 379.5
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m1); coef(m2); coef(m3); coef(m4); coef(m5); 
## (Intercept)         age 
##  -4.0753783   0.1941355
## (Intercept)        sexM 
##  -1.3233805   0.9179154
## (Intercept)       goout 
##  -3.3368302   0.7563474
## (Intercept)    failures 
##  -0.9920231   0.6758788
## (Intercept)    absences 
## -1.26944532  0.08866504
coef(m6); coef(m7); coef(m8); coef(m9); coef(m10); 
## (Intercept)          G3 
##   0.1378294  -0.0868889
## (Intercept)         age        sexM 
##  -4.7261439   0.2042451   0.9325859
## (Intercept)       goout    failures    absences          G3 
## -3.42137467  0.69790726  0.52077970  0.07373480 -0.01611567
## (Intercept)         age        sexM       goout    failures    absences 
## -4.72723868  0.03622756  0.98157242  0.69396783  0.47932468  0.08171370
## (Intercept)        sexM       goout    failures    absences 
## -4.14389330  0.97988506  0.69801208  0.48932427  0.08245946
# comparing models
m82 <- glm(high_use ~ goout + failures + absences, data = alc, family = "binomial")
anova(m8, m82, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ goout + failures + absences + G3
## Model 2: high_use ~ goout + failures + absences
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       365     383.91                     
## 2       366     384.06 -1 -0.14891   0.6996
anova(m9, m10, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ age + sex + goout + failures + absences
## Model 2: high_use ~ sex + goout + failures + absences
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       364      369.4                     
## 2       365      369.5 -1 -0.10135   0.7502
# compute odds ratios (OR) and confidence intervals (CI) for the final model
OR10 <- coef(m10) %>% exp %>% round(digits = 2)
CI10 <- confint(m10) %>% exp %>% round(digits = 2)

# print out the odds ratios with their confidence intervals
cbind(OR10, CI10)
##             OR10 2.5 % 97.5 %
## (Intercept) 0.02  0.01   0.04
## sexM        2.66  1.61   4.49
## goout       2.01  1.59   2.56
## failures    1.63  1.04   2.59
## absences    1.09  1.04   1.14

In univariate logistic regression models, all selected explanatory variables as well as age and sex as demographic factors are statistically significantly associated with high alcohol use. In a multiple regression model, however, final grade and age are not any more significantly associated with high alcohol consumption. Likelihood ratio tests being insignificant also favor leaving them out of the final model. The point estimates of included explanatory variables in the final model are nearly the same as in the respective univariate models.

In the final model, one unit increase in going out increased the odds of using alcohol at high level by 101% (59%-156%), in failures by 63% (4%-159%) and in absences by 9% (4%-14%). Male sex increased the odds (95% CI) by 166% (61%-349%).

# PREDICTIVE ABILITY OF THE FINAL MODEL

## access dplyr and ggplot2
library(dplyr); library(ggplot2)

# fit the model
m10 <- glm(high_use ~ sex + goout + failures + absences, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m10, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% addmargins %>% round(digits = 2)
##         prediction
## high_use FALSE TRUE Sum
##    FALSE   243   16 259
##    TRUE     61   50 111
##    Sum     304   66 370
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins %>% round(digits = 2)
##         prediction
## high_use FALSE TRUE  Sum
##    FALSE  0.66 0.04 0.70
##    TRUE   0.16 0.14 0.30
##    Sum    0.82 0.18 1.00
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = high_use, y = probability, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0) %>% round(digits = 2)
## [1] 0.3
loss_func(class = alc$high_use, prob = 1) %>% round(digits = 2)
## [1] 0.7
loss_func(class = alc$high_use, prob = alc$probability) %>% round(digits = 2)
## [1] 0.21

The overall accuracy of the model (i.e., accurate classification) is high: 293 (79%) participants are correctly classified. There is not much difference in the prediction in between the user levels: out of those predicted to be low users of alcohol 243 (80%) are correctly classified whereas out of those predicted to be high users 61 (76%) are truly high users.

The model does increase the predictive performance compared to random guessing: 21% are inaccurately classified by the model, whereas 30% are inaccurately classified by randomly guessing that probability is 0 for all and 70 % by a randomly guessing that probability is 1 for all.

# BONUS SECTION: ON CROSS-VALIDATION

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m10, K = nrow(alc) / 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2135135

The final model does have a slightly better test set performance, error being 0.21 compared with the respective error produced by the model in the Exercise Set (0.26).

# SUPER-BONUS SECTION: ON CROSS-VALIDATION

# I'm sure there is a much more elegant way of writing this R code...

# different models and their summaries
m21 <- glm(high_use ~ age + sex + health + address + famsize + Pstatus + Fedu + Mjob + guardian + school + reason + schoolsup + famsup + paid + activities + nursery + higher + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m21)
## 
## Call:
## glm(formula = high_use ~ age + sex + health + address + famsize + 
##     Pstatus + Fedu + Mjob + guardian + school + reason + schoolsup + 
##     famsup + paid + activities + nursery + higher + internet + 
##     romantic + famrel + studytime + traveltime + freetime + goout + 
##     failures + absences + G3, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7733  -0.6942  -0.3703   0.5275   2.8698  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.06934    2.95252  -1.040 0.298541    
## age               0.05401    0.14774   0.366 0.714696    
## sexM              0.86707    0.33018   2.626 0.008639 ** 
## health            0.19978    0.11237   1.778 0.075433 .  
## addressU         -0.85388    0.41398  -2.063 0.039149 *  
## famsizeLE3        0.50174    0.32064   1.565 0.117621    
## PstatusT         -0.21412    0.51220  -0.418 0.675910    
## Fedu              0.04477    0.15749   0.284 0.776218    
## Mjobhealth       -0.86921    0.66777  -1.302 0.193033    
## Mjobother        -0.76813    0.46827  -1.640 0.100929    
## Mjobservices     -0.65342    0.51502  -1.269 0.204540    
## Mjobteacher      -0.36452    0.57503  -0.634 0.526139    
## guardianmother   -0.81573    0.36120  -2.258 0.023921 *  
## guardianother    -0.02361    0.77809  -0.030 0.975790    
## schoolMS         -0.41859    0.55098  -0.760 0.447418    
## reasonhome        0.43518    0.38491   1.131 0.258223    
## reasonother       1.33386    0.52127   2.559 0.010501 *  
## reasonreputation -0.05167    0.40177  -0.129 0.897677    
## schoolsupyes      0.07500    0.45298   0.166 0.868493    
## famsupyes        -0.15239    0.32301  -0.472 0.637080    
## paidyes           0.64657    0.31745   2.037 0.041673 *  
## activitiesyes    -0.54344    0.30862  -1.761 0.078264 .  
## nurseryyes       -0.46789    0.36303  -1.289 0.197453    
## higheryes        -0.17828    0.71758  -0.248 0.803787    
## internetyes       0.21525    0.44485   0.484 0.628467    
## romanticyes      -0.53828    0.32531  -1.655 0.097993 .  
## famrel           -0.52687    0.16561  -3.181 0.001465 ** 
## studytime        -0.28025    0.20271  -1.382 0.166824    
## traveltime        0.25226    0.22899   1.102 0.270621    
## freetime          0.23073    0.16392   1.408 0.159248    
## goout             0.87900    0.15327   5.735 9.75e-09 ***
## failures          0.29553    0.27415   1.078 0.281042    
## absences          0.09199    0.02614   3.519 0.000433 ***
## G3                0.02725    0.05191   0.525 0.599637    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 318.90  on 336  degrees of freedom
## AIC: 386.9
## 
## Number of Fisher Scoring iterations: 5
m22 <- glm(high_use ~ age + sex + health + address + famsize + Pstatus + Fedu + Mjob + guardian + school + reason + famsup + paid + activities + nursery + higher + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m22)
## 
## Call:
## glm(formula = high_use ~ age + sex + health + address + famsize + 
##     Pstatus + Fedu + Mjob + guardian + school + reason + famsup + 
##     paid + activities + nursery + higher + internet + romantic + 
##     famrel + studytime + traveltime + freetime + goout + failures + 
##     absences + G3, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7609  -0.6940  -0.3720   0.5267   2.8647  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.96925    2.89116  -1.027 0.304415    
## age               0.04918    0.14482   0.340 0.734150    
## sexM              0.85938    0.32682   2.630 0.008551 ** 
## health            0.19904    0.11228   1.773 0.076282 .  
## addressU         -0.84988    0.41350  -2.055 0.039849 *  
## famsizeLE3        0.49997    0.32044   1.560 0.118693    
## PstatusT         -0.21693    0.51219  -0.424 0.671902    
## Fedu              0.04554    0.15741   0.289 0.772346    
## Mjobhealth       -0.87549    0.66660  -1.313 0.189056    
## Mjobother        -0.77016    0.46819  -1.645 0.099975 .  
## Mjobservices     -0.65192    0.51482  -1.266 0.205401    
## Mjobteacher      -0.36879    0.57448  -0.642 0.520900    
## guardianmother   -0.81692    0.36120  -2.262 0.023718 *  
## guardianother    -0.02329    0.77775  -0.030 0.976110    
## schoolMS         -0.42351    0.55004  -0.770 0.441325    
## reasonhome        0.43646    0.38496   1.134 0.256889    
## reasonother       1.33457    0.52098   2.562 0.010417 *  
## reasonreputation -0.05037    0.40179  -0.125 0.900245    
## famsupyes        -0.15014    0.32279  -0.465 0.641833    
## paidyes           0.64340    0.31691   2.030 0.042334 *  
## activitiesyes    -0.53930    0.30764  -1.753 0.079599 .  
## nurseryyes       -0.46621    0.36269  -1.285 0.198650    
## higheryes        -0.17380    0.71753  -0.242 0.808610    
## internetyes       0.21588    0.44490   0.485 0.627516    
## romanticyes      -0.53981    0.32514  -1.660 0.096873 .  
## famrel           -0.52685    0.16560  -3.181 0.001465 ** 
## studytime        -0.27968    0.20274  -1.380 0.167738    
## traveltime        0.25567    0.22830   1.120 0.262772    
## freetime          0.23063    0.16394   1.407 0.159502    
## goout             0.87846    0.15325   5.732  9.9e-09 ***
## failures          0.29549    0.27438   1.077 0.281504    
## absences          0.09192    0.02612   3.519 0.000433 ***
## G3                0.02598    0.05130   0.506 0.612585    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 318.93  on 337  degrees of freedom
## AIC: 384.93
## 
## Number of Fisher Scoring iterations: 5
m23 <- glm(high_use ~ age + sex + health + address + famsize + Pstatus + Fedu + Mjob + guardian + school + reason + famsup + paid + activities + nursery + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m23)
## 
## Call:
## glm(formula = high_use ~ age + sex + health + address + famsize + 
##     Pstatus + Fedu + Mjob + guardian + school + reason + famsup + 
##     paid + activities + nursery + internet + romantic + famrel + 
##     studytime + traveltime + freetime + goout + failures + absences + 
##     G3, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7885  -0.6919  -0.3695   0.5206   2.8581  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.19543    2.73368  -1.169 0.242440    
## age               0.05587    0.14190   0.394 0.693793    
## sexM              0.86745    0.32512   2.668 0.007629 ** 
## health            0.19870    0.11227   1.770 0.076750 .  
## addressU         -0.85349    0.41317  -2.066 0.038858 *  
## famsizeLE3        0.49873    0.32026   1.557 0.119410    
## PstatusT         -0.21809    0.51176  -0.426 0.669994    
## Fedu              0.04169    0.15666   0.266 0.790163    
## Mjobhealth       -0.88016    0.66551  -1.323 0.185988    
## Mjobother        -0.77519    0.46717  -1.659 0.097050 .  
## Mjobservices     -0.65521    0.51392  -1.275 0.202341    
## Mjobteacher      -0.37101    0.57428  -0.646 0.518248    
## guardianmother   -0.81986    0.36078  -2.272 0.023060 *  
## guardianother    -0.03054    0.77805  -0.039 0.968686    
## schoolMS         -0.43916    0.54672  -0.803 0.421826    
## reasonhome        0.43014    0.38412   1.120 0.262790    
## reasonother       1.33275    0.52008   2.563 0.010389 *  
## reasonreputation -0.05181    0.40164  -0.129 0.897366    
## famsupyes        -0.15501    0.32206  -0.481 0.630290    
## paidyes           0.63438    0.31470   2.016 0.043817 *  
## activitiesyes    -0.54909    0.30503  -1.800 0.071840 .  
## nurseryyes       -0.46197    0.36200  -1.276 0.201902    
## internetyes       0.22716    0.44252   0.513 0.607727    
## romanticyes      -0.52965    0.32216  -1.644 0.100163    
## famrel           -0.52666    0.16560  -3.180 0.001471 ** 
## studytime        -0.28176    0.20239  -1.392 0.163879    
## traveltime        0.25472    0.22814   1.117 0.264205    
## freetime          0.22989    0.16397   1.402 0.160909    
## goout             0.87855    0.15327   5.732 9.93e-09 ***
## failures          0.30386    0.27304   1.113 0.265750    
## absences          0.09209    0.02602   3.540 0.000401 ***
## G3                0.02346    0.05022   0.467 0.640372    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 318.98  on 338  degrees of freedom
## AIC: 382.98
## 
## Number of Fisher Scoring iterations: 5
m24 <- glm(high_use ~ age + sex + health + address + famsize + Pstatus + Mjob + guardian + school + reason + famsup + paid + activities + nursery + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m24)
## 
## Call:
## glm(formula = high_use ~ age + sex + health + address + famsize + 
##     Pstatus + Mjob + guardian + school + reason + famsup + paid + 
##     activities + nursery + internet + romantic + famrel + studytime + 
##     traveltime + freetime + goout + failures + absences + G3, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8024  -0.6867  -0.3726   0.5270   2.8586  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.05049    2.67834  -1.139 0.254725    
## age               0.05400    0.14173   0.381 0.703172    
## sexM              0.86914    0.32518   2.673 0.007522 ** 
## health            0.19961    0.11211   1.781 0.074992 .  
## addressU         -0.85786    0.41243  -2.080 0.037526 *  
## famsizeLE3        0.48881    0.31803   1.537 0.124296    
## PstatusT         -0.23584    0.50787  -0.464 0.642377    
## Mjobhealth       -0.86508    0.66285  -1.305 0.191865    
## Mjobother        -0.78793    0.46436  -1.697 0.089737 .  
## Mjobservices     -0.65306    0.51381  -1.271 0.203720    
## Mjobteacher      -0.33772    0.56006  -0.603 0.546508    
## guardianmother   -0.84113    0.35206  -2.389 0.016887 *  
## guardianother    -0.03532    0.77788  -0.045 0.963782    
## schoolMS         -0.43365    0.54688  -0.793 0.427804    
## reasonhome        0.43122    0.38402   1.123 0.261480    
## reasonother       1.33397    0.52063   2.562 0.010400 *  
## reasonreputation -0.04724    0.40116  -0.118 0.906266    
## famsupyes        -0.14646    0.32032  -0.457 0.647509    
## paidyes           0.63330    0.31460   2.013 0.044111 *  
## activitiesyes    -0.54591    0.30486  -1.791 0.073341 .  
## nurseryyes       -0.44785    0.35820  -1.250 0.211199    
## internetyes       0.23102    0.44189   0.523 0.601110    
## romanticyes      -0.52520    0.32146  -1.634 0.102303    
## famrel           -0.52726    0.16571  -3.182 0.001463 ** 
## studytime        -0.28577    0.20165  -1.417 0.156434    
## traveltime        0.24871    0.22696   1.096 0.273141    
## freetime          0.22620    0.16335   1.385 0.166121    
## goout             0.88397    0.15201   5.815 6.06e-09 ***
## failures          0.28968    0.26763   1.082 0.279066    
## absences          0.09278    0.02592   3.579 0.000345 ***
## G3                0.02455    0.05001   0.491 0.623489    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 319.05  on 339  degrees of freedom
## AIC: 381.05
## 
## Number of Fisher Scoring iterations: 5
m25 <- glm(high_use ~ sex + health + address + famsize + Pstatus + Mjob + guardian + school + reason + famsup + paid + activities + nursery + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m25)
## 
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Pstatus + 
##     Mjob + guardian + school + reason + famsup + paid + activities + 
##     nursery + internet + romantic + famrel + studytime + traveltime + 
##     freetime + goout + failures + absences + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8115  -0.6780  -0.3741   0.5351   2.8662  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.18345    1.40586  -1.553 0.120398    
## sexM              0.87542    0.32423   2.700 0.006935 ** 
## health            0.19812    0.11194   1.770 0.076744 .  
## addressU         -0.87032    0.41091  -2.118 0.034173 *  
## famsizeLE3        0.48549    0.31783   1.528 0.126636    
## PstatusT         -0.23136    0.50751  -0.456 0.648477    
## Mjobhealth       -0.86599    0.66254  -1.307 0.191187    
## Mjobother        -0.78347    0.46470  -1.686 0.091799 .  
## Mjobservices     -0.65478    0.51433  -1.273 0.202993    
## Mjobteacher      -0.33410    0.55988  -0.597 0.550689    
## guardianmother   -0.83727    0.35200  -2.379 0.017378 *  
## guardianother     0.03273    0.75903   0.043 0.965609    
## schoolMS         -0.36884    0.52019  -0.709 0.478288    
## reasonhome        0.42938    0.38416   1.118 0.263691    
## reasonother       1.31806    0.51857   2.542 0.011030 *  
## reasonreputation -0.04819    0.40091  -0.120 0.904328    
## famsupyes        -0.15410    0.31946  -0.482 0.629540    
## paidyes           0.63749    0.31432   2.028 0.042544 *  
## activitiesyes    -0.54843    0.30457  -1.801 0.071751 .  
## nurseryyes       -0.43975    0.35738  -1.231 0.218509    
## internetyes       0.22314    0.44134   0.506 0.613142    
## romanticyes      -0.50541    0.31711  -1.594 0.110986    
## famrel           -0.52404    0.16539  -3.169 0.001532 ** 
## studytime        -0.28015    0.20040  -1.398 0.162138    
## traveltime        0.24680    0.22724   1.086 0.277444    
## freetime          0.22240    0.16299   1.365 0.172397    
## goout             0.88971    0.15127   5.881 4.07e-09 ***
## failures          0.30230    0.26479   1.142 0.253582    
## absences          0.09363    0.02576   3.635 0.000278 ***
## G3                0.02386    0.05003   0.477 0.633404    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 319.20  on 340  degrees of freedom
## AIC: 379.2
## 
## Number of Fisher Scoring iterations: 5
m26 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + school + reason + famsup + paid + activities + nursery + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m26)
## 
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob + 
##     guardian + school + reason + famsup + paid + activities + 
##     nursery + internet + romantic + famrel + studytime + traveltime + 
##     freetime + goout + failures + absences + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8257  -0.6871  -0.3759   0.5287   2.8577  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.39445    1.32839  -1.803 0.071465 .  
## sexM              0.87214    0.32419   2.690 0.007141 ** 
## health            0.19659    0.11203   1.755 0.079284 .  
## addressU         -0.87215    0.41028  -2.126 0.033526 *  
## famsizeLE3        0.49404    0.31668   1.560 0.118747    
## Mjobhealth       -0.85441    0.66232  -1.290 0.197039    
## Mjobother        -0.75819    0.46117  -1.644 0.100168    
## Mjobservices     -0.63041    0.51137  -1.233 0.217651    
## Mjobteacher      -0.31297    0.55700  -0.562 0.574195    
## guardianmother   -0.82015    0.34963  -2.346 0.018987 *  
## guardianother     0.07088    0.76004   0.093 0.925699    
## schoolMS         -0.37703    0.51952  -0.726 0.468011    
## reasonhome        0.42329    0.38397   1.102 0.270289    
## reasonother       1.31565    0.51775   2.541 0.011051 *  
## reasonreputation -0.04565    0.40123  -0.114 0.909410    
## famsupyes        -0.15297    0.31920  -0.479 0.631787    
## paidyes           0.62587    0.31306   1.999 0.045585 *  
## activitiesyes    -0.55798    0.30363  -1.838 0.066104 .  
## nurseryyes       -0.42608    0.35608  -1.197 0.231472    
## internetyes       0.20904    0.44029   0.475 0.634943    
## romanticyes      -0.50246    0.31704  -1.585 0.113007    
## famrel           -0.52594    0.16579  -3.172 0.001512 ** 
## studytime        -0.28026    0.20035  -1.399 0.161869    
## traveltime        0.24293    0.22631   1.073 0.283073    
## freetime          0.21494    0.16181   1.328 0.184067    
## goout             0.88814    0.15097   5.883 4.04e-09 ***
## failures          0.30053    0.26510   1.134 0.256940    
## absences          0.09458    0.02555   3.702 0.000214 ***
## G3                0.02611    0.04981   0.524 0.600062    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 319.40  on 341  degrees of freedom
## AIC: 377.4
## 
## Number of Fisher Scoring iterations: 5
m27 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + school + reason + famsup + paid + activities + nursery + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m27)
## 
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob + 
##     guardian + school + reason + famsup + paid + activities + 
##     nursery + romantic + famrel + studytime + traveltime + freetime + 
##     goout + failures + absences + G3, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8328  -0.6796  -0.3747   0.5404   2.8683  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.32240    1.31885  -1.761 0.078251 .  
## sexM              0.87340    0.32425   2.694 0.007068 ** 
## health            0.19373    0.11162   1.736 0.082649 .  
## addressU         -0.83448    0.40234  -2.074 0.038073 *  
## famsizeLE3        0.49752    0.31629   1.573 0.115724    
## Mjobhealth       -0.79381    0.65218  -1.217 0.223542    
## Mjobother        -0.71259    0.45084  -1.581 0.113970    
## Mjobservices     -0.57330    0.49752  -1.152 0.249191    
## Mjobteacher      -0.25368    0.54375  -0.467 0.640826    
## guardianmother   -0.82837    0.34921  -2.372 0.017686 *  
## guardianother     0.05555    0.75518   0.074 0.941360    
## schoolMS         -0.37018    0.51814  -0.714 0.474950    
## reasonhome        0.41920    0.38397   1.092 0.274942    
## reasonother       1.31310    0.51752   2.537 0.011172 *  
## reasonreputation -0.04511    0.40044  -0.113 0.910317    
## famsupyes        -0.14499    0.31820  -0.456 0.648633    
## paidyes           0.63651    0.31213   2.039 0.041427 *  
## activitiesyes    -0.54865    0.30311  -1.810 0.070282 .  
## nurseryyes       -0.44357    0.35358  -1.255 0.209647    
## romanticyes      -0.49066    0.31597  -1.553 0.120454    
## famrel           -0.52478    0.16579  -3.165 0.001549 ** 
## studytime        -0.27793    0.20027  -1.388 0.165190    
## traveltime        0.24304    0.22637   1.074 0.282979    
## freetime          0.21735    0.16147   1.346 0.178275    
## goout             0.89168    0.15095   5.907 3.48e-09 ***
## failures          0.29404    0.26526   1.109 0.267637    
## absences          0.09631    0.02527   3.811 0.000139 ***
## G3                0.02666    0.04981   0.535 0.592451    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 319.63  on 342  degrees of freedom
## AIC: 375.63
## 
## Number of Fisher Scoring iterations: 5
m28 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + school + reason + paid + activities + nursery + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m28)
## 
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob + 
##     guardian + school + reason + paid + activities + nursery + 
##     romantic + famrel + studytime + traveltime + freetime + goout + 
##     failures + absences + G3, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8383  -0.6862  -0.3765   0.5306   2.8548  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.38141    1.31082  -1.817 0.069258 .  
## sexM              0.89197    0.32200   2.770 0.005605 ** 
## health            0.18934    0.11099   1.706 0.088019 .  
## addressU         -0.82360    0.40107  -2.054 0.040024 *  
## famsizeLE3        0.50629    0.31542   1.605 0.108462    
## Mjobhealth       -0.83338    0.64562  -1.291 0.196763    
## Mjobother        -0.70548    0.45022  -1.567 0.117122    
## Mjobservices     -0.58517    0.49658  -1.178 0.238640    
## Mjobteacher      -0.27326    0.54111  -0.505 0.613565    
## guardianmother   -0.82408    0.34927  -2.359 0.018302 *  
## guardianother     0.03067    0.75544   0.041 0.967611    
## schoolMS         -0.33536    0.51293  -0.654 0.513234    
## reasonhome        0.41271    0.38353   1.076 0.281888    
## reasonother       1.32527    0.51650   2.566 0.010292 *  
## reasonreputation -0.04696    0.39985  -0.117 0.906507    
## paidyes           0.60780    0.30520   1.991 0.046429 *  
## activitiesyes    -0.54458    0.30288  -1.798 0.072176 .  
## nurseryyes       -0.44167    0.35364  -1.249 0.211699    
## romanticyes      -0.49720    0.31594  -1.574 0.115551    
## famrel           -0.52100    0.16548  -3.148 0.001642 ** 
## studytime        -0.28997    0.19876  -1.459 0.144587    
## traveltime        0.24073    0.22604   1.065 0.286887    
## freetime          0.21119    0.16069   1.314 0.188762    
## goout             0.89196    0.15073   5.918 3.27e-09 ***
## failures          0.29978    0.26516   1.131 0.258244    
## absences          0.09569    0.02512   3.810 0.000139 ***
## G3                0.02776    0.04978   0.558 0.577025    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 319.84  on 343  degrees of freedom
## AIC: 373.84
## 
## Number of Fisher Scoring iterations: 5
m29 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + school + reason + paid + activities + nursery + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences, data = alc, family = "binomial")
summary(m29)
## 
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob + 
##     guardian + school + reason + paid + activities + nursery + 
##     romantic + famrel + studytime + traveltime + freetime + goout + 
##     failures + absences, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8834  -0.6840  -0.3776   0.5414   2.8104  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.05408    1.16884  -1.757 0.078855 .  
## sexM              0.88825    0.32061   2.771 0.005597 ** 
## health            0.18175    0.11005   1.652 0.098633 .  
## addressU         -0.80345    0.39831  -2.017 0.043681 *  
## famsizeLE3        0.52116    0.31421   1.659 0.097196 .  
## Mjobhealth       -0.78493    0.63846  -1.229 0.218919    
## Mjobother        -0.69892    0.44936  -1.555 0.119860    
## Mjobservices     -0.54356    0.48975  -1.110 0.267057    
## Mjobteacher      -0.25304    0.53837  -0.470 0.638348    
## guardianmother   -0.82071    0.34918  -2.350 0.018755 *  
## guardianother     0.01112    0.74944   0.015 0.988160    
## schoolMS         -0.35999    0.51109  -0.704 0.481214    
## reasonhome        0.41977    0.38309   1.096 0.273185    
## reasonother       1.32112    0.51711   2.555 0.010624 *  
## reasonreputation -0.04785    0.39908  -0.120 0.904569    
## paidyes           0.61105    0.30511   2.003 0.045206 *  
## activitiesyes    -0.53253    0.30174  -1.765 0.077585 .  
## nurseryyes       -0.44481    0.35308  -1.260 0.207738    
## romanticyes      -0.50890    0.31513  -1.615 0.106340    
## famrel           -0.51743    0.16515  -3.133 0.001730 ** 
## studytime        -0.27759    0.19682  -1.410 0.158435    
## traveltime        0.23358    0.22496   1.038 0.299133    
## freetime          0.21474    0.16047   1.338 0.180840    
## goout             0.87700    0.14758   5.943 2.81e-09 ***
## failures          0.25908    0.25368   1.021 0.307102    
## absences          0.09469    0.02499   3.790 0.000151 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 320.15  on 344  degrees of freedom
## AIC: 372.15
## 
## Number of Fisher Scoring iterations: 5
m30 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + reason + paid + activities + nursery + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences, data = alc, family = "binomial")
summary(m30)
## 
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob + 
##     guardian + reason + paid + activities + nursery + romantic + 
##     famrel + studytime + traveltime + freetime + goout + failures + 
##     absences, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8839  -0.6898  -0.3775   0.5367   2.8255  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.17730    1.15541  -1.884 0.059505 .  
## sexM              0.89525    0.31999   2.798 0.005146 ** 
## health            0.19003    0.10953   1.735 0.082761 .  
## addressU         -0.72544    0.38358  -1.891 0.058593 .  
## famsizeLE3        0.51249    0.31324   1.636 0.101822    
## Mjobhealth       -0.77546    0.63678  -1.218 0.223304    
## Mjobother        -0.70399    0.44936  -1.567 0.117195    
## Mjobservices     -0.54322    0.48904  -1.111 0.266656    
## Mjobteacher      -0.27495    0.53673  -0.512 0.608467    
## guardianmother   -0.79620    0.34737  -2.292 0.021900 *  
## guardianother     0.02925    0.75311   0.039 0.969021    
## reasonhome        0.40747    0.38255   1.065 0.286817    
## reasonother       1.26791    0.51068   2.483 0.013035 *  
## reasonreputation -0.02835    0.39735  -0.071 0.943126    
## paidyes           0.60009    0.30427   1.972 0.048585 *  
## activitiesyes    -0.51400    0.30017  -1.712 0.086831 .  
## nurseryyes       -0.43167    0.35401  -1.219 0.222709    
## romanticyes      -0.52006    0.31407  -1.656 0.097747 .  
## famrel           -0.51113    0.16409  -3.115 0.001840 ** 
## studytime        -0.26663    0.19567  -1.363 0.173005    
## traveltime        0.20841    0.22109   0.943 0.345871    
## freetime          0.20529    0.15996   1.283 0.199357    
## goout             0.87369    0.14741   5.927 3.08e-09 ***
## failures          0.27010    0.25267   1.069 0.285069    
## absences          0.09587    0.02486   3.857 0.000115 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 320.66  on 345  degrees of freedom
## AIC: 370.66
## 
## Number of Fisher Scoring iterations: 5
m31 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + reason + paid + activities + nursery + romantic + famrel + studytime + freetime + goout + failures + absences, data = alc, family = "binomial")
summary(m31)
## 
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob + 
##     guardian + reason + paid + activities + nursery + romantic + 
##     famrel + studytime + freetime + goout + failures + absences, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9201  -0.6693  -0.3788   0.6131   2.8134  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.73261    1.04989  -1.650 0.098888 .  
## sexM              0.90570    0.32041   2.827 0.004704 ** 
## health            0.19158    0.10943   1.751 0.080000 .  
## addressU         -0.85025    0.35852  -2.372 0.017714 *  
## famsizeLE3        0.54170    0.31175   1.738 0.082274 .  
## Mjobhealth       -0.82277    0.63458  -1.297 0.194785    
## Mjobother        -0.71149    0.44970  -1.582 0.113618    
## Mjobservices     -0.54651    0.48736  -1.121 0.262138    
## Mjobteacher      -0.30889    0.53811  -0.574 0.565951    
## guardianmother   -0.82702    0.34479  -2.399 0.016456 *  
## guardianother     0.09127    0.74590   0.122 0.902610    
## reasonhome        0.37498    0.38059   0.985 0.324487    
## reasonother       1.21870    0.50695   2.404 0.016218 *  
## reasonreputation -0.05788    0.39576  -0.146 0.883733    
## paidyes           0.61034    0.30451   2.004 0.045036 *  
## activitiesyes    -0.51451    0.30001  -1.715 0.086355 .  
## nurseryyes       -0.43764    0.35353  -1.238 0.215742    
## romanticyes      -0.51168    0.31366  -1.631 0.102818    
## famrel           -0.50391    0.16334  -3.085 0.002035 ** 
## studytime        -0.28187    0.19527  -1.444 0.148876    
## freetime          0.19340    0.15912   1.215 0.224208    
## goout             0.88148    0.14680   6.004 1.92e-09 ***
## failures          0.28102    0.25169   1.117 0.264191    
## absences          0.09545    0.02474   3.859 0.000114 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 321.54  on 346  degrees of freedom
## AIC: 369.54
## 
## Number of Fisher Scoring iterations: 5
m32 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + reason + paid + activities + nursery + romantic + famrel + studytime + freetime + goout + absences, data = alc, family = "binomial")
summary(m32)
## 
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob + 
##     guardian + reason + paid + activities + nursery + romantic + 
##     famrel + studytime + freetime + goout + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8364  -0.6916  -0.3742   0.6186   2.8205  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.62838    1.03909  -1.567  0.11709    
## sexM              0.92409    0.31965   2.891  0.00384 ** 
## health            0.20762    0.10799   1.923  0.05453 .  
## addressU         -0.86632    0.35739  -2.424  0.01535 *  
## famsizeLE3        0.54741    0.31159   1.757  0.07895 .  
## Mjobhealth       -0.91185    0.63436  -1.437  0.15060    
## Mjobother        -0.75304    0.44670  -1.686  0.09184 .  
## Mjobservices     -0.56842    0.48561  -1.171  0.24179    
## Mjobteacher      -0.41462    0.53144  -0.780  0.43528    
## guardianmother   -0.81293    0.34396  -2.363  0.01811 *  
## guardianother     0.21191    0.73660   0.288  0.77359    
## reasonhome        0.37154    0.37997   0.978  0.32817    
## reasonother       1.21736    0.50496   2.411  0.01592 *  
## reasonreputation -0.08370    0.39379  -0.213  0.83167    
## paidyes           0.57695    0.30272   1.906  0.05666 .  
## activitiesyes    -0.52457    0.29905  -1.754  0.07941 .  
## nurseryyes       -0.42051    0.35095  -1.198  0.23084    
## romanticyes      -0.50262    0.31226  -1.610  0.10748    
## famrel           -0.52456    0.16165  -3.245  0.00117 ** 
## studytime        -0.31888    0.19250  -1.657  0.09762 .  
## freetime          0.19459    0.15863   1.227  0.21994    
## goout             0.90675    0.14581   6.219 5.01e-10 ***
## absences          0.09679    0.02472   3.916 8.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 322.81  on 347  degrees of freedom
## AIC: 368.81
## 
## Number of Fisher Scoring iterations: 5
m33 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + reason + paid + activities + romantic + famrel + studytime + freetime + goout + absences, data = alc, family = "binomial")
summary(m33)
## 
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob + 
##     guardian + reason + paid + activities + romantic + famrel + 
##     studytime + freetime + goout + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8594  -0.7167  -0.3836   0.6146   2.7734  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.81020    1.02798  -1.761  0.07825 .  
## sexM              0.90662    0.31905   2.842  0.00449 ** 
## health            0.20825    0.10722   1.942  0.05210 .  
## addressU         -0.88950    0.35733  -2.489  0.01280 *  
## famsizeLE3        0.49667    0.30835   1.611  0.10723    
## Mjobhealth       -0.99444    0.63260  -1.572  0.11595    
## Mjobother        -0.73015    0.44639  -1.636  0.10191    
## Mjobservices     -0.56579    0.48261  -1.172  0.24105    
## Mjobteacher      -0.45608    0.52878  -0.863  0.38840    
## guardianmother   -0.83220    0.34354  -2.422  0.01542 *  
## guardianother     0.30055    0.73749   0.408  0.68362    
## reasonhome        0.36564    0.37990   0.962  0.33581    
## reasonother       1.20614    0.50090   2.408  0.01604 *  
## reasonreputation -0.08088    0.39255  -0.206  0.83676    
## paidyes           0.56829    0.30209   1.881  0.05995 .  
## activitiesyes    -0.50589    0.29824  -1.696  0.08984 .  
## romanticyes      -0.49832    0.31073  -1.604  0.10879    
## famrel           -0.52072    0.16068  -3.241  0.00119 ** 
## studytime        -0.34536    0.19110  -1.807  0.07073 .  
## freetime          0.18664    0.15763   1.184  0.23639    
## goout             0.89963    0.14511   6.200 5.65e-10 ***
## absences          0.09505    0.02457   3.868  0.00011 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 324.23  on 348  degrees of freedom
## AIC: 368.23
## 
## Number of Fisher Scoring iterations: 5
m34 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + reason + paid + activities + romantic + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m34)
## 
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob + 
##     guardian + reason + paid + activities + romantic + famrel + 
##     studytime + goout + absences, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8798  -0.7229  -0.3861   0.6079   2.7816  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.42394    0.96719  -1.472  0.14096    
## sexM              0.94987    0.31747   2.992  0.00277 ** 
## health            0.20524    0.10683   1.921  0.05472 .  
## addressU         -0.86236    0.35600  -2.422  0.01542 *  
## famsizeLE3        0.48070    0.30752   1.563  0.11802    
## Mjobhealth       -0.98267    0.62987  -1.560  0.11873    
## Mjobother        -0.69956    0.44378  -1.576  0.11494    
## Mjobservices     -0.52858    0.47796  -1.106  0.26877    
## Mjobteacher      -0.41155    0.52733  -0.780  0.43513    
## guardianmother   -0.84478    0.34325  -2.461  0.01385 *  
## guardianother     0.37225    0.73463   0.507  0.61235    
## reasonhome        0.30688    0.37404   0.820  0.41196    
## reasonother       1.17332    0.49988   2.347  0.01891 *  
## reasonreputation -0.11529    0.39057  -0.295  0.76786    
## paidyes           0.56116    0.30094   1.865  0.06222 .  
## activitiesyes    -0.47737    0.29595  -1.613  0.10674    
## romanticyes      -0.46686    0.30805  -1.516  0.12964    
## famrel           -0.50323    0.15945  -3.156  0.00160 ** 
## studytime        -0.35054    0.19051  -1.840  0.06576 .  
## goout             0.94273    0.14165   6.655 2.83e-11 ***
## absences          0.09338    0.02452   3.807  0.00014 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 325.64  on 349  degrees of freedom
## AIC: 367.64
## 
## Number of Fisher Scoring iterations: 5
m35 <- glm(high_use ~ sex + health + address + famsize + guardian + reason + paid + activities + romantic + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m35)
## 
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + guardian + 
##     reason + paid + activities + romantic + famrel + studytime + 
##     goout + absences, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8674  -0.7171  -0.4029   0.5821   2.7665  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.72681    0.94633  -1.825 0.068041 .  
## sexM              0.91034    0.30568   2.978 0.002901 ** 
## health            0.17681    0.10399   1.700 0.089096 .  
## addressU         -0.94771    0.34320  -2.761 0.005756 ** 
## famsizeLE3        0.45183    0.30426   1.485 0.137536    
## guardianmother   -0.75088    0.33290  -2.256 0.024096 *  
## guardianother     0.40855    0.73349   0.557 0.577533    
## reasonhome        0.20790    0.36585   0.568 0.569858    
## reasonother       1.02210    0.48846   2.093 0.036393 *  
## reasonreputation -0.27714    0.37752  -0.734 0.462877    
## paidyes           0.54838    0.29252   1.875 0.060834 .  
## activitiesyes    -0.46361    0.28927  -1.603 0.109002    
## romanticyes      -0.45457    0.30736  -1.479 0.139160    
## famrel           -0.48800    0.15813  -3.086 0.002028 ** 
## studytime        -0.32528    0.18830  -1.727 0.084081 .  
## goout             0.90858    0.13835   6.567 5.13e-11 ***
## absences          0.09174    0.02432   3.773 0.000161 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 329.01  on 353  degrees of freedom
## AIC: 363.01
## 
## Number of Fisher Scoring iterations: 5
m36 <- glm(high_use ~ sex + health + address + guardian + reason + paid + activities + romantic + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m36)
## 
## Call:
## glm(formula = high_use ~ sex + health + address + guardian + 
##     reason + paid + activities + romantic + famrel + studytime + 
##     goout + absences, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9142  -0.7135  -0.4238   0.6314   2.7314  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.57319    0.93900  -1.675 0.093858 .  
## sexM              0.94921    0.30255   3.137 0.001705 ** 
## health            0.16840    0.10362   1.625 0.104131    
## addressU         -0.88181    0.33899  -2.601 0.009288 ** 
## guardianmother   -0.69988    0.32862  -2.130 0.033194 *  
## guardianother     0.40360    0.72237   0.559 0.576358    
## reasonhome        0.17061    0.36325   0.470 0.638586    
## reasonother       0.97289    0.48759   1.995 0.046011 *  
## reasonreputation -0.28239    0.37686  -0.749 0.453655    
## paidyes           0.54303    0.29151   1.863 0.062486 .  
## activitiesyes    -0.47048    0.28867  -1.630 0.103146    
## romanticyes      -0.41451    0.30400  -1.363 0.172726    
## famrel           -0.49000    0.15648  -3.131 0.001740 ** 
## studytime        -0.33980    0.18695  -1.818 0.069132 .  
## goout             0.89470    0.13681   6.540 6.17e-11 ***
## absences          0.09182    0.02445   3.755 0.000174 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 331.20  on 354  degrees of freedom
## AIC: 363.2
## 
## Number of Fisher Scoring iterations: 5
m37 <- glm(high_use ~ sex + health + address + guardian + reason + paid + activities + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m37)
## 
## Call:
## glm(formula = high_use ~ sex + health + address + guardian + 
##     reason + paid + activities + famrel + studytime + goout + 
##     absences, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8365  -0.7167  -0.4191   0.6244   2.7675  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.64430    0.93738  -1.754 0.079404 .  
## sexM              0.95963    0.30116   3.186 0.001440 ** 
## health            0.15499    0.10292   1.506 0.132074    
## addressU         -0.87733    0.33754  -2.599 0.009346 ** 
## guardianmother   -0.69452    0.32728  -2.122 0.033826 *  
## guardianother     0.34130    0.71444   0.478 0.632848    
## reasonhome        0.17125    0.36156   0.474 0.635751    
## reasonother       0.88750    0.48109   1.845 0.065069 .  
## reasonreputation -0.26597    0.37622  -0.707 0.479601    
## paidyes           0.55647    0.29067   1.914 0.055563 .  
## activitiesyes    -0.49094    0.28719  -1.709 0.087364 .  
## famrel           -0.47145    0.15487  -3.044 0.002334 ** 
## studytime        -0.35932    0.18770  -1.914 0.055578 .  
## goout             0.88596    0.13606   6.511 7.45e-11 ***
## absences          0.08737    0.02401   3.639 0.000274 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 333.10  on 355  degrees of freedom
## AIC: 363.1
## 
## Number of Fisher Scoring iterations: 5
m38 <- glm(high_use ~ sex + address + guardian + reason + paid + activities + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m38)
## 
## Call:
## glm(formula = high_use ~ sex + address + guardian + reason + 
##     paid + activities + famrel + studytime + goout + absences, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9219  -0.7435  -0.4192   0.6018   2.8389  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.12869    0.86885  -1.299 0.193927    
## sexM              1.00904    0.29874   3.378 0.000731 ***
## addressU         -0.89571    0.33781  -2.652 0.008013 ** 
## guardianmother   -0.70751    0.32584  -2.171 0.029904 *  
## guardianother     0.29125    0.71045   0.410 0.681839    
## reasonhome        0.14091    0.35838   0.393 0.694178    
## reasonother       0.86090    0.47814   1.800 0.071783 .  
## reasonreputation -0.33539    0.37241  -0.901 0.367793    
## paidyes           0.52319    0.28763   1.819 0.068918 .  
## activitiesyes    -0.50255    0.28643  -1.755 0.079335 .  
## famrel           -0.43812    0.15217  -2.879 0.003987 ** 
## studytime        -0.35475    0.18676  -1.899 0.057500 .  
## goout             0.87536    0.13595   6.439 1.21e-10 ***
## absences          0.08705    0.02377   3.662 0.000251 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 335.41  on 356  degrees of freedom
## AIC: 363.41
## 
## Number of Fisher Scoring iterations: 5
m39 <- glm(high_use ~ sex + address + guardian + paid + activities + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m39)
## 
## Call:
## glm(formula = high_use ~ sex + address + guardian + paid + activities + 
##     famrel + studytime + goout + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8631  -0.7338  -0.4343   0.6525   2.7038  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.98780    0.84464  -1.169 0.242204    
## sexM            1.01751    0.29384   3.463 0.000535 ***
## addressU       -0.88774    0.32823  -2.705 0.006838 ** 
## guardianmother -0.68091    0.32127  -2.119 0.034056 *  
## guardianother   0.27875    0.68783   0.405 0.685287    
## paidyes         0.58954    0.27992   2.106 0.035195 *  
## activitiesyes  -0.52740    0.27932  -1.888 0.059003 .  
## famrel         -0.41406    0.14977  -2.765 0.005698 ** 
## studytime      -0.42783    0.18304  -2.337 0.019422 *  
## goout           0.84987    0.13236   6.421 1.35e-10 ***
## absences        0.08434    0.02276   3.705 0.000211 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 341.12  on 359  degrees of freedom
## AIC: 363.12
## 
## Number of Fisher Scoring iterations: 5
m40 <- glm(high_use ~ sex + address + paid + activities + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m40)
## 
## Call:
## glm(formula = high_use ~ sex + address + paid + activities + 
##     famrel + studytime + goout + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8947  -0.7350  -0.4209   0.6334   2.8292  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.50769    0.81175  -1.857 0.063263 .  
## sexM           1.03295    0.29075   3.553 0.000381 ***
## addressU      -0.80243    0.32151  -2.496 0.012566 *  
## paidyes        0.55253    0.27681   1.996 0.045928 *  
## activitiesyes -0.50399    0.27583  -1.827 0.067672 .  
## famrel        -0.39748    0.14678  -2.708 0.006769 ** 
## studytime     -0.41410    0.18179  -2.278 0.022734 *  
## goout          0.82299    0.12977   6.342 2.27e-10 ***
## absences       0.07971    0.02254   3.536 0.000406 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 347.06  on 361  degrees of freedom
## AIC: 365.06
## 
## Number of Fisher Scoring iterations: 5
m41 <- glm(high_use ~ sex + address + paid + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m41)
## 
## Call:
## glm(formula = high_use ~ sex + address + paid + famrel + studytime + 
##     goout + absences, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9387  -0.7500  -0.4471   0.7226   2.7051  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.55596    0.80528  -1.932 0.053334 .  
## sexM         0.94192    0.28379   3.319 0.000903 ***
## addressU    -0.74838    0.31779  -2.355 0.018526 *  
## paidyes      0.54137    0.27503   1.968 0.049027 *  
## famrel      -0.39952    0.14509  -2.754 0.005896 ** 
## studytime   -0.46626    0.18048  -2.583 0.009782 ** 
## goout        0.80085    0.12757   6.278 3.43e-10 ***
## absences     0.07814    0.02253   3.468 0.000524 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 350.45  on 362  degrees of freedom
## AIC: 366.45
## 
## Number of Fisher Scoring iterations: 5
m42 <- glm(high_use ~ sex + address + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m42)
## 
## Call:
## glm(formula = high_use ~ sex + address + famrel + studytime + 
##     goout + absences, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9143  -0.7377  -0.4386   0.7459   2.5829  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.32310    0.79014  -1.675 0.094030 .  
## sexM         0.88170    0.27864   3.164 0.001555 ** 
## addressU    -0.68822    0.31409  -2.191 0.028439 *  
## famrel      -0.40542    0.14396  -2.816 0.004860 ** 
## studytime   -0.42030    0.17536  -2.397 0.016538 *  
## goout        0.78974    0.12648   6.244 4.27e-10 ***
## absences     0.07432    0.02230   3.332 0.000861 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 354.40  on 363  degrees of freedom
## AIC: 368.4
## 
## Number of Fisher Scoring iterations: 4
m43 <- glm(high_use ~ sex + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m43)
## 
## Call:
## glm(formula = high_use ~ sex + famrel + studytime + goout + absences, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7673  -0.7725  -0.4525   0.7122   2.6684  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.80705    0.75693  -2.387 0.016970 *  
## sexM         0.90778    0.27500   3.301 0.000963 ***
## famrel      -0.40238    0.14164  -2.841 0.004499 ** 
## studytime   -0.41980    0.17376  -2.416 0.015693 *  
## goout        0.76397    0.12465   6.129 8.85e-10 ***
## absences     0.07613    0.02242   3.396 0.000683 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 359.16  on 364  degrees of freedom
## AIC: 371.16
## 
## Number of Fisher Scoring iterations: 4
m44 <- glm(high_use ~ sex + famrel + goout + absences, data = alc, family = "binomial")
summary(m44)
## 
## Call:
## glm(formula = high_use ~ sex + famrel + goout + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7082  -0.7909  -0.5005   0.7441   2.5673  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.71101    0.66639  -4.068 4.74e-05 ***
## sexM         1.07940    0.26449   4.081 4.48e-05 ***
## famrel      -0.41726    0.14164  -2.946 0.003219 ** 
## goout        0.76974    0.12445   6.185 6.20e-10 ***
## absences     0.08218    0.02227   3.691 0.000224 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 365.31  on 365  degrees of freedom
## AIC: 375.31
## 
## Number of Fisher Scoring iterations: 4
m45 <- glm(high_use ~ sex + goout + absences, data = alc, family = "binomial")
summary(m45)
## 
## Call:
## glm(formula = high_use ~ sex + goout + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8060  -0.8090  -0.5248   0.8214   2.4806  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.18142    0.48085  -8.696  < 2e-16 ***
## sexM         1.02223    0.25946   3.940 8.15e-05 ***
## goout        0.72793    0.12057   6.038 1.56e-09 ***
## absences     0.08478    0.02266   3.741 0.000183 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 374.10  on 366  degrees of freedom
## AIC: 382.1
## 
## Number of Fisher Scoring iterations: 4
m46 <- glm(high_use ~ sex + goout, data = alc, family = "binomial")
summary(m46)
## 
## Call:
## glm(formula = high_use ~ sex + goout, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5740  -0.8658  -0.6216   0.8272   2.4929  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.8193     0.4572  -8.354  < 2e-16 ***
## sexM          0.9268     0.2507   3.696 0.000219 ***
## goout         0.7578     0.1190   6.369  1.9e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 388.94  on 367  degrees of freedom
## AIC: 394.94
## 
## Number of Fisher Scoring iterations: 4
m47 <- glm(high_use ~ goout, data = alc, family = "binomial")
summary(m47)
## 
## Call:
## glm(formula = high_use ~ goout, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3712  -0.7687  -0.5470   0.9952   2.3037  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.3368     0.4188  -7.968 1.62e-15 ***
## goout         0.7563     0.1165   6.494 8.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 403.05  on 368  degrees of freedom
## AIC: 407.05
## 
## Number of Fisher Scoring iterations: 4
# predict() the probabilities of high_use
prob21 <- predict(m21, type = "response")
prob22 <- predict(m22, type = "response")
prob23 <- predict(m23, type = "response")
prob24 <- predict(m24, type = "response")
prob25 <- predict(m25, type = "response")
prob26 <- predict(m26, type = "response")
prob27 <- predict(m27, type = "response")
prob28 <- predict(m28, type = "response")
prob29 <- predict(m29, type = "response")
prob30 <- predict(m30, type = "response")
prob31 <- predict(m31, type = "response")
prob32 <- predict(m32, type = "response")
prob33 <- predict(m33, type = "response")
prob34 <- predict(m34, type = "response")
prob35 <- predict(m35, type = "response")
prob36 <- predict(m36, type = "response")
prob37 <- predict(m37, type = "response")
prob38 <- predict(m38, type = "response")
prob39 <- predict(m39, type = "response")
prob40 <- predict(m40, type = "response")
prob41 <- predict(m41, type = "response")
prob42 <- predict(m42, type = "response")
prob43 <- predict(m43, type = "response")
prob44 <- predict(m44, type = "response")
prob45 <- predict(m45, type = "response")
prob46 <- predict(m46, type = "response")
prob47 <- predict(m47, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, prob21 = prob21)
alc <- mutate(alc, prob22 = prob22)
alc <- mutate(alc, prob23 = prob23)
alc <- mutate(alc, prob24 = prob24)
alc <- mutate(alc, prob25 = prob25)
alc <- mutate(alc, prob26 = prob26)
alc <- mutate(alc, prob27 = prob27)
alc <- mutate(alc, prob28 = prob28)
alc <- mutate(alc, prob29 = prob29)
alc <- mutate(alc, prob30 = prob30)
alc <- mutate(alc, prob31 = prob31)
alc <- mutate(alc, prob32 = prob32)
alc <- mutate(alc, prob33 = prob33)
alc <- mutate(alc, prob34 = prob34)
alc <- mutate(alc, prob35 = prob35)
alc <- mutate(alc, prob36 = prob36)
alc <- mutate(alc, prob37 = prob37)
alc <- mutate(alc, prob38 = prob38)
alc <- mutate(alc, prob39 = prob39)
alc <- mutate(alc, prob40 = prob40)
alc <- mutate(alc, prob41 = prob41)
alc <- mutate(alc, prob42 = prob42)
alc <- mutate(alc, prob43 = prob43)
alc <- mutate(alc, prob44 = prob44)
alc <- mutate(alc, prob45 = prob45)
alc <- mutate(alc, prob46 = prob46)
alc <- mutate(alc, prob47 = prob47)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, pred21 = prob21 > 0.5)
alc <- mutate(alc, pred22 = prob22 > 0.5)
alc <- mutate(alc, pred23 = prob23 > 0.5)
alc <- mutate(alc, pred24 = prob24 > 0.5)
alc <- mutate(alc, pred25 = prob25 > 0.5)
alc <- mutate(alc, pred26 = prob26 > 0.5)
alc <- mutate(alc, pred27 = prob27 > 0.5)
alc <- mutate(alc, pred28 = prob28 > 0.5)
alc <- mutate(alc, pred29 = prob29 > 0.5)
alc <- mutate(alc, pred30 = prob30 > 0.5)
alc <- mutate(alc, pred31 = prob31 > 0.5)
alc <- mutate(alc, pred32 = prob32 > 0.5)
alc <- mutate(alc, pred33 = prob33 > 0.5)
alc <- mutate(alc, pred34 = prob34 > 0.5)
alc <- mutate(alc, pred35 = prob35 > 0.5)
alc <- mutate(alc, pred36 = prob36 > 0.5)
alc <- mutate(alc, pred37 = prob37 > 0.5)
alc <- mutate(alc, pred38 = prob38 > 0.5)
alc <- mutate(alc, pred39 = prob39 > 0.5)
alc <- mutate(alc, pred40 = prob40 > 0.5)
alc <- mutate(alc, pred41 = prob41 > 0.5)
alc <- mutate(alc, pred42 = prob42 > 0.5)
alc <- mutate(alc, pred43 = prob43 > 0.5)
alc <- mutate(alc, pred44 = prob44 > 0.5)
alc <- mutate(alc, pred45 = prob45 > 0.5)
alc <- mutate(alc, pred46 = prob46 > 0.5)
alc <- mutate(alc, pred47 = prob47 > 0.5)

# print column names
colnames(alc)
##  [1] "X"           "school"      "sex"         "age"         "address"    
##  [6] "famsize"     "Pstatus"     "Medu"        "Fedu"        "Mjob"       
## [11] "Fjob"        "reason"      "guardian"    "traveltime"  "studytime"  
## [16] "schoolsup"   "famsup"      "activities"  "nursery"     "higher"     
## [21] "internet"    "romantic"    "famrel"      "freetime"    "goout"      
## [26] "Dalc"        "Walc"        "health"      "failures"    "paid"       
## [31] "absences"    "G1"          "G2"          "G3"          "alc_use"    
## [36] "high_use"    "probability" "prediction"  "prob21"      "prob22"     
## [41] "prob23"      "prob24"      "prob25"      "prob26"      "prob27"     
## [46] "prob28"      "prob29"      "prob30"      "prob31"      "prob32"     
## [51] "prob33"      "prob34"      "prob35"      "prob36"      "prob37"     
## [56] "prob38"      "prob39"      "prob40"      "prob41"      "prob42"     
## [61] "prob43"      "prob44"      "prob45"      "prob46"      "prob47"     
## [66] "pred21"      "pred22"      "pred23"      "pred24"      "pred25"     
## [71] "pred26"      "pred27"      "pred28"      "pred29"      "pred30"     
## [76] "pred31"      "pred32"      "pred33"      "pred34"      "pred35"     
## [81] "pred36"      "pred37"      "pred38"      "pred39"      "pred40"     
## [86] "pred41"      "pred42"      "pred43"      "pred44"      "pred45"     
## [91] "pred46"      "pred47"
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$prob21) %>% round(digits = 3)
## [1] 0.189
loss_func(class = alc$high_use, prob = alc$prob22) %>% round(digits = 3)
## [1] 0.186
loss_func(class = alc$high_use, prob = alc$prob23) %>% round(digits = 3)
## [1] 0.184
loss_func(class = alc$high_use, prob = alc$prob24) %>% round(digits = 3)
## [1] 0.181
loss_func(class = alc$high_use, prob = alc$prob25) %>% round(digits = 3)
## [1] 0.178
loss_func(class = alc$high_use, prob = alc$prob26) %>% round(digits = 3)
## [1] 0.186
loss_func(class = alc$high_use, prob = alc$prob27) %>% round(digits = 3)
## [1] 0.176
loss_func(class = alc$high_use, prob = alc$prob28) %>% round(digits = 3)
## [1] 0.181
loss_func(class = alc$high_use, prob = alc$prob29) %>% round(digits = 3)
## [1] 0.178
loss_func(class = alc$high_use, prob = alc$prob30) %>% round(digits = 3)
## [1] 0.184
loss_func(class = alc$high_use, prob = alc$prob31) %>% round(digits = 3)
## [1] 0.195
loss_func(class = alc$high_use, prob = alc$prob32) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob33) %>% round(digits = 3)
## [1] 0.203
loss_func(class = alc$high_use, prob = alc$prob34) %>% round(digits = 3)
## [1] 0.186
loss_func(class = alc$high_use, prob = alc$prob35) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob36) %>% round(digits = 3)
## [1] 0.181
loss_func(class = alc$high_use, prob = alc$prob37) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob38) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob39) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob40) %>% round(digits = 3)
## [1] 0.203
loss_func(class = alc$high_use, prob = alc$prob41) %>% round(digits = 3)
## [1] 0.197
loss_func(class = alc$high_use, prob = alc$prob42) %>% round(digits = 3)
## [1] 0.211
loss_func(class = alc$high_use, prob = alc$prob43) %>% round(digits = 3)
## [1] 0.216
loss_func(class = alc$high_use, prob = alc$prob44) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob45) %>% round(digits = 3)
## [1] 0.211
loss_func(class = alc$high_use, prob = alc$prob46) %>% round(digits = 3)
## [1] 0.214
loss_func(class = alc$high_use, prob = alc$prob47) %>% round(digits = 3)
## [1] 0.27
# K-fold cross-validation and average number of wrong predictions in the cross validation
library(boot)
cv21 <- cv.glm(data = alc, cost = loss_func, glmfit = m21, K = nrow(alc) / 10)
cv22 <- cv.glm(data = alc, cost = loss_func, glmfit = m22, K = nrow(alc) / 10)
cv23 <- cv.glm(data = alc, cost = loss_func, glmfit = m23, K = nrow(alc) / 10)
cv24 <- cv.glm(data = alc, cost = loss_func, glmfit = m24, K = nrow(alc) / 10)
cv25 <- cv.glm(data = alc, cost = loss_func, glmfit = m25, K = nrow(alc) / 10)
cv26 <- cv.glm(data = alc, cost = loss_func, glmfit = m26, K = nrow(alc) / 10)
cv27 <- cv.glm(data = alc, cost = loss_func, glmfit = m27, K = nrow(alc) / 10)
cv28 <- cv.glm(data = alc, cost = loss_func, glmfit = m28, K = nrow(alc) / 10)
cv29 <- cv.glm(data = alc, cost = loss_func, glmfit = m29, K = nrow(alc) / 10)
cv30 <- cv.glm(data = alc, cost = loss_func, glmfit = m30, K = nrow(alc) / 10)
cv31 <- cv.glm(data = alc, cost = loss_func, glmfit = m31, K = nrow(alc) / 10)
cv32 <- cv.glm(data = alc, cost = loss_func, glmfit = m32, K = nrow(alc) / 10)
cv33 <- cv.glm(data = alc, cost = loss_func, glmfit = m33, K = nrow(alc) / 10)
cv34 <- cv.glm(data = alc, cost = loss_func, glmfit = m34, K = nrow(alc) / 10)
cv35 <- cv.glm(data = alc, cost = loss_func, glmfit = m35, K = nrow(alc) / 10)
cv36 <- cv.glm(data = alc, cost = loss_func, glmfit = m36, K = nrow(alc) / 10)
cv37 <- cv.glm(data = alc, cost = loss_func, glmfit = m37, K = nrow(alc) / 10)
cv38 <- cv.glm(data = alc, cost = loss_func, glmfit = m38, K = nrow(alc) / 10)
cv39 <- cv.glm(data = alc, cost = loss_func, glmfit = m39, K = nrow(alc) / 10)
cv40 <- cv.glm(data = alc, cost = loss_func, glmfit = m40, K = nrow(alc) / 10)
cv41 <- cv.glm(data = alc, cost = loss_func, glmfit = m41, K = nrow(alc) / 10)
cv42 <- cv.glm(data = alc, cost = loss_func, glmfit = m42, K = nrow(alc) / 10)
cv43 <- cv.glm(data = alc, cost = loss_func, glmfit = m43, K = nrow(alc) / 10)
cv44 <- cv.glm(data = alc, cost = loss_func, glmfit = m44, K = nrow(alc) / 10)
cv45 <- cv.glm(data = alc, cost = loss_func, glmfit = m45, K = nrow(alc) / 10)
cv46 <- cv.glm(data = alc, cost = loss_func, glmfit = m46, K = nrow(alc) / 10)
cv47 <- cv.glm(data = alc, cost = loss_func, glmfit = m47, K = nrow(alc) / 10)

# average number of wrong predictions in the cross validation
cv21$delta[1] %>% round(digits = 3)
## [1] 0.246
cv22$delta[1] %>% round(digits = 3)
## [1] 0.246
cv23$delta[1] %>% round(digits = 3)
## [1] 0.243
cv24$delta[1] %>% round(digits = 3)
## [1] 0.224
cv25$delta[1] %>% round(digits = 3)
## [1] 0.243
cv26$delta[1] %>% round(digits = 3)
## [1] 0.227
cv27$delta[1] %>% round(digits = 3)
## [1] 0.23
cv28$delta[1] %>% round(digits = 3)
## [1] 0.241
cv29$delta[1] %>% round(digits = 3)
## [1] 0.235
cv30$delta[1] %>% round(digits = 3)
## [1] 0.235
cv31$delta[1] %>% round(digits = 3)
## [1] 0.23
cv32$delta[1] %>% round(digits = 3)
## [1] 0.224
cv33$delta[1] %>% round(digits = 3)
## [1] 0.232
cv34$delta[1] %>% round(digits = 3)
## [1] 0.214
cv35$delta[1] %>% round(digits = 3)
## [1] 0.219
cv36$delta[1] %>% round(digits = 3)
## [1] 0.216
cv37$delta[1] %>% round(digits = 3)
## [1] 0.222
cv38$delta[1] %>% round(digits = 3)
## [1] 0.219
cv39$delta[1] %>% round(digits = 3)
## [1] 0.216
cv40$delta[1] %>% round(digits = 3)
## [1] 0.219
cv41$delta[1] %>% round(digits = 3)
## [1] 0.211
cv42$delta[1] %>% round(digits = 3)
## [1] 0.222
cv43$delta[1] %>% round(digits = 3)
## [1] 0.224
cv44$delta[1] %>% round(digits = 3)
## [1] 0.205
cv45$delta[1] %>% round(digits = 3)
## [1] 0.219
cv46$delta[1] %>% round(digits = 3)
## [1] 0.23
cv47$delta[1] %>% round(digits = 3)
## [1] 0.27
# create a new dataset
err <- data.frame(npredictor = c(1:27), 
       train = c(0.189, 0.186, 0.184, 0.181, 0.178, 0.186, 0.176, 0.181, 0.178, 0.184, 0.195, 0.200, 0.203, 0.186, 0.200, 0.181, 0.200, 0.200, 0.200, 0.203, 0.197, 0.211, 0.216, 0.200, 0.211, 0.214, 0.270), 
       test = c(0.259, 0.235, 0.249, 0.235, 0.241, 0.251, 0.232, 0.23, 0.227, 0.227, 0.232, 0.230, 0.227, 0.216, 0.219, 0.219, 0.227, 0.222, 0.222, 0.224, 0.211, 0.224, 0.230, 0.216, 0.219, 0.243, 0.270))

# change the format of dataset
err_long <- err %>% pivot_longer(cols = c("train", "test"), names_to = "model", values_to = "ploss")

# print dimensions and column names
dim(err); colnames(err)
## [1] 27  3
## [1] "npredictor" "train"      "test"
dim(err_long); colnames(err_long)
## [1] 54  3
## [1] "npredictor" "model"      "ploss"
# access the ggplot2 library
library(ggplot2)

# Initialize plots with data and aesthetic mapping
ep1 <- ggplot(err_long, aes(x = npredictor, y = ploss, color = model))

# Define the visualization type (points) + add a regression line
# + add a main title and draw the plot
ep2 <- ep1 + geom_point() + geom_smooth(method = "glm", formula = "y ~ poly(x, 2)") + ggtitle("Predictive loss in training and testing")
ep2

# new data for prediction
errpred <- data.frame(npredictor = c(1:27))

# fits the model
predm1 <- lm(train ~ npredictor^2, data = err)
predm2 <- lm(test ~ npredictor^2, data = err)

predm1 <- glm(formula = "train ~ poly(npredictor, 2)", data = err)
predm2 <- glm(formula = "test ~ poly(npredictor, 2)", data = err)


# predicts the values with confidence interval
predict(predm1, newdata = errpred, interval = 'confidence') %>% round(digits = 3)
##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 0.187 0.186 0.185 0.184 0.183 0.183 0.183 0.183 0.184 0.184 0.185 0.186 0.188 
##    14    15    16    17    18    19    20    21    22    23    24    25    26 
## 0.189 0.191 0.193 0.196 0.198 0.201 0.204 0.207 0.211 0.215 0.219 0.223 0.227 
##    27 
## 0.232
predict(predm2, newdata = errpred, interval = 'confidence') %>% round(digits = 3)
##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 0.256 0.251 0.247 0.243 0.239 0.236 0.233 0.230 0.228 0.226 0.224 0.223 0.222 
##    14    15    16    17    18    19    20    21    22    23    24    25    26 
## 0.221 0.221 0.221 0.221 0.222 0.223 0.224 0.225 0.227 0.230 0.232 0.235 0.238 
##    27 
## 0.242
predict(predm1, newdata = errpred, interval = 'confidence') %>% round(digits = 3) %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1830  0.1845  0.1890  0.1966  0.2055  0.2320
predict(predm2, newdata = errpred, interval = 'confidence') %>% round(digits = 3) %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2210  0.2230  0.2280  0.2311  0.2370  0.2560
# summary values for dataset err
summary(err)
##    npredictor       train             test      
##  Min.   : 1.0   Min.   :0.1760   Min.   :0.211  
##  1st Qu.: 7.5   1st Qu.:0.1840   1st Qu.:0.222  
##  Median :14.0   Median :0.1970   Median :0.227  
##  Mean   :14.0   Mean   :0.1967   Mean   :0.231  
##  3rd Qu.:20.5   3rd Qu.:0.2015   3rd Qu.:0.235  
##  Max.   :27.0   Max.   :0.2700   Max.   :0.270

In order to compare different logistic regression models, I chose to start with a model with 27 explanatory variables, i.e. nearly all variables that were available. I built 27 different models, reducing the model by one variable at a time, deleting the variable with the highest p-value. In the model with seven explanatory variables - sex, address, paid, family relations, study time, going out and absences - all are statistically significant (p < 0.05).

The training errors varied from 0.176 to 0.270 and testing errors from 0.211 to 0.270. In the plot, it can be seen that the relationships between the number of the explanatory variables and errors are not linear but more towards 2-degree polynomial. In training, a model with 7 explanatory variables has the smallest error (0.176); in testing, the minimal error (0.211) is with a model with 21 explanatory variables. On the other hand, when these errors are modeled, the smallest error comes with 5-8 explanatory variables in training (0.183) and with 14-17 explanatory variables in testing (0.221). All in all, the errors varied little, implying that the phenomena (high use of alcohol) is indeed multifaceted by nature. In addition, the categorization of some variables does not seem to be statistically optimal and some ordinal variables might be better modeled as categorical ones.


Clustering and classification

knitr::opts_chunk$set(message = FALSE)
date()
## [1] "Wed Nov 30 13:09:04 2022"
# LOADING THE DATA

# access the MASS package
library(MASS)

# load the data
data("Boston")

# define subset of data for boxplots
box <- dplyr::select(Boston, -tax, -black)

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# plot distributions and matrix of the variables
boxplot(box)

boxplot(Boston$tax)

boxplot(Boston$blac)

p <- pairs(Boston)

p
## NULL

The Boston data contain information from 506 towns on factors such as crime rate, residential land, nitrogen oxides concentration, rooms per dwelling, tax rate, pupil-teacher ratio, population status and value of owner-occupied homes. Altogether 14 variables are included. No values are missing.

Variable distributions are mostly skewed. Median age of population is 77.5 years (range 2.9 to 100 years). Crime rate varies from 0.006 to 89.0, tax rate from 187 to 711 per 10,000 $, pupil-teacher ratio from 12.6 to 22.0, proportion of lower population status from 1.7% to 38.0% and nitrogen oxides concentration 0.39 to 0.87 parts per 10 million between towns.

# EXPLORING THE DATA

# access the tidyr and corrplot libraries
library(tidyr)
library(corrplot)

# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)

# print the correlation matrix
cor_matrix 
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", tl.pos = "d", tl.cex = 0.6)

The variables are mostly correlated with each other, except for the Charles River dummy variable describing if the tract bounds river or not which is correlated only mildly with the median value of owner-occupied homes (rho = 0.18). There are remarkably high correlations between age and nitrogen oxides concentration (rho = 0.73) and mean distance to main employment centres (rho = -0.75), between non-retail business acres and nitrogen oxides concentration (rho = 0.76), mean distance to main employment centres (rho = -0.71) and tax rate (rho = 0.72), between average number of rooms per dwelling and median value of owner-occupied homes (rho = 0.70), between accessibility to radial highways and tax rate (rho = 0.91), and between proportion of lower population status and median value of owner-occupied homes (rho = -0.74).

# SCALING THE DATASET AND CREATING 'CRIME' VARIABLE

# accessing library dplyr
library(dplyr)

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# plot distributions
boxplot(boston_scaled)

# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- scale(Boston) %>% as.data.frame

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)

# look at the table of the new factor crime
table(crime)
## crime
## [-0.419,-0.411]  (-0.411,-0.39] (-0.39,0.00739]  (0.00739,9.92] 
##             127             126             126             127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Scaling removes most of the skewness in variables.

# LINEAR DISCRIMINANT ANALYSIS

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
## [-0.419,-0.411]  (-0.411,-0.39] (-0.39,0.00739]  (0.00739,9.92] 
##       0.2475248       0.2475248       0.2549505       0.2500000 
## 
## Group means:
##                          zn      indus         chas        nox         rm
## [-0.419,-0.411]  0.89018782 -0.9190750 -0.154216061 -0.8743449  0.4512779
## (-0.411,-0.39]  -0.06618671 -0.3373848  0.003267949 -0.5983556 -0.1160011
## (-0.39,0.00739] -0.38566716  0.1861715  0.224586496  0.3999370  0.0976330
## (0.00739,9.92]  -0.48724019  1.0171306 -0.116404306  1.0515355 -0.4378830
##                        age        dis        rad        tax     ptratio
## [-0.419,-0.411] -0.8829452  0.8296495 -0.6936060 -0.7156118 -0.42658514
## (-0.411,-0.39]  -0.3634925  0.4129459 -0.5362660 -0.4753680 -0.06860847
## (-0.39,0.00739]  0.4241615 -0.3693295 -0.4054077 -0.3020525 -0.29472237
## (0.00739,9.92]   0.8229709 -0.8493803  1.6379981  1.5139626  0.78062517
##                      black       lstat         medv
## [-0.419,-0.411]  0.3738794 -0.78741280  0.498873736
## (-0.411,-0.39]   0.3079896 -0.13388251  0.002195656
## (-0.39,0.00739]  0.1307855  0.06235575  0.152982883
## (0.00739,9.92]  -0.8517941  0.87419487 -0.632693170
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.06955390  0.71451785 -1.02516946
## indus    0.03847793 -0.37339824  0.34293147
## chas    -0.09189611 -0.11887772  0.07646332
## nox      0.42385230 -0.75275018 -1.28283102
## rm      -0.11164654 -0.14722340 -0.17374637
## age      0.28442603 -0.39368069 -0.06693408
## dis      0.02277236 -0.43683052  0.39566030
## rad      3.14817296  0.81955106  0.07341605
## tax      0.03329946  0.13300674  0.38016840
## ptratio  0.12307423  0.03153954 -0.30586427
## black   -0.13803955 -0.03609761  0.07546738
## lstat    0.23174743 -0.27561482  0.45717631
## medv     0.25307415 -0.42300323 -0.07479530
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9481 0.0398 0.0121
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

In linear discriminant analysis, the highest crime class almost solely occupies one cluster and the other crime classes are mixed in the other cluster. The most influential line separators seem to be accessibility to radial highways, proportion of residential land zoned for lots over 2323 m2 and nitrogen oxides concentration, most likely implying differences between rural and urban environments. The first discriminant function separates 94.7% of the population.

# PREDICTING WITH THE LDA MODEL

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class) %>% addmargins
##                  predicted
## correct           [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
##   [-0.419,-0.411]              16              9               2              0
##   (-0.411,-0.39]                3             16               7              0
##   (-0.39,0.00739]               2              6              14              1
##   (0.00739,9.92]                0              0               0             26
##   Sum                          21             31              23             27
##                  predicted
## correct           Sum
##   [-0.419,-0.411]  27
##   (-0.411,-0.39]   26
##   (-0.39,0.00739]  23
##   (0.00739,9.92]   26
##   Sum             102

The LDA model works reasonably well: it correctly predicts the crime class in 69 (68%) towns.

# DISTANCES AND CLUSTERING

# access library ggplot2
library(ggplot2)

# scale dataset
boston_scaled2 <- data.frame(scale(Boston))

# euclidean distance matrix
dist_eu <- dist(boston_scaled2)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled2, method = "manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
# k-means clustering
km <- kmeans(boston_scaled2, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)

pairs(boston_scaled2[1:7], col = km$cluster)

pairs(boston_scaled2[8:14], col = km$cluster)

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <- kmeans(boston_scaled2, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)

pairs(boston_scaled2[1:7], col = km$cluster)

pairs(boston_scaled2[8:14], col = km$cluster)

Scaling reduces the distances and skewness of their distribution. For k-means clustering, 2 centers seem to be the best choice. One cluster seem to be associated with low crime rate, low proportion of non-retail business acres, low nitrogen oxides concentration, younger age, high mean distance to main employment centres, low accessibility to radial highways, low tax rate, high difference in race proportions and high median value of owner-occupied homes. This may refer to that clustering identifies rural areas or otherwise less densely populated areas vs. urban areas.

# BONUS SECTION: MORE ON K-MEANS CLUSTERING AND LINEAR DISCRIMINANT ANALYSIS

# k-means clustering
km2 <- kmeans(boston_scaled2, centers = 6)

# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km2$cluster)

pairs(boston_scaled2[1:7], col = km2$cluster)

pairs(boston_scaled2[8:14], col = km2$cluster)

# linear discriminant analysis
lda.fit2 <- lda(km2$cluster ~ ., data = boston_scaled2)

# print the lda.fit object
lda.fit2
## Call:
## lda(km2$cluster ~ ., data = boston_scaled2)
## 
## Prior probabilities of groups:
##          1          2          3          4          5          6 
## 0.06719368 0.12252964 0.24505929 0.18972332 0.29051383 0.08498024 
## 
## Group means:
##         crim         zn      indus       chas        nox         rm        age
## 1 -0.1985497 -0.2602436  0.2799956  3.6647712  0.3830784  0.2756681  0.3721322
## 2 -0.4141953  2.3322773 -1.1788641 -0.2088275 -1.1887944  0.7188485 -1.4216721
## 3  1.1156495 -0.4872402  1.0149946 -0.2723291  0.9916473 -0.4276828  0.7515952
## 4 -0.3269211 -0.4816572  0.6406213 -0.2723291  0.4664828 -0.5082323  0.7668344
## 5 -0.3977957 -0.1563279 -0.5990216 -0.2723291 -0.6696819 -0.1686625 -0.6751063
## 6 -0.3732407 -0.1422288 -0.8310017 -0.2723291 -0.2005297  1.6901170  0.1841367
##          dis          rad        tax     ptratio       black      lstat
## 1 -0.4033382  0.001081444 -0.0975633 -0.39245849  0.17154271 -0.1643525
## 2  1.6223203 -0.666968948 -0.5535972 -0.82951564  0.35193833 -0.9665363
## 3 -0.8170870  1.659602895  1.5294129  0.80577843 -0.81154619  0.9129958
## 4 -0.5732656 -0.602637816 -0.1649468  0.21059409  0.04824716  0.5397714
## 5  0.5672190 -0.575610755 -0.6877857 -0.05330275  0.36267528 -0.3956902
## 6 -0.3232389 -0.511800977 -0.8155263 -1.10522083  0.34963036 -0.9616241
##           medv
## 1  0.573340910
## 2  0.799806470
## 3 -0.771340259
## 4 -0.505310108
## 5  0.007601824
## 6  1.719927960
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3          LD4         LD5
## crim     0.034508803  0.05797085 -0.16103219  0.152322830  0.01702637
## zn       0.263383770 -0.23309900 -1.54525759 -1.099430119  0.78886901
## indus   -0.073596625  0.35511069  0.27621202 -0.741302164 -0.03049072
## chas    -5.833329077 -0.28810784 -0.25366217 -0.153849858 -0.20101832
## nox      0.012019761 -0.33200265  0.08267993 -0.138256930  0.38821434
## rm       0.077523647  0.02092389 -0.05279584  0.346603115  0.54639625
## age     -0.078503080  0.10642631  0.50741330 -0.163867301  0.79431423
## dis     -0.055475885 -0.36793105 -0.33307268 -0.006313437 -0.37843684
## rad     -0.319867249  3.41804068 -1.06679562  1.629076215 -0.94341422
## tax      0.011360375 -0.30248228 -0.48448257 -0.924010759  0.59988910
## ptratio -0.010141142 -0.03367258  0.16740977 -0.504498820  0.12595753
## black   -0.007393166 -0.08886704  0.01924355  0.045161817 -0.07202500
## lstat    0.123902259  0.13351602  0.01047765  0.007966329  0.40940453
## medv     0.203273034 -0.38745314 -0.07192725  0.516887596  0.76079378
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4    LD5 
## 0.6292 0.2589 0.0730 0.0230 0.0159
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# plot the lda results
plot(lda.fit2, dimen = 2, col = km2$cluster, pch = km2$cluster)
lda.arrows(lda.fit2, myscale = 2)

Linear discriminant analysis produces three clusters, in which one cluster is occupied by k-means cluster 1, another by k-means cluster 3, and the third cluster contains the rest k-means clusters. The most influential line separators seem to be accessibility to radial highways and Charles River dummy variable. The first discriminant function separates 62.9% and the second one 25.9% of the towns.

# SUPER-BONUS SECTION: 3D PLOTTING

# access library plotly
library(plotly)

# define a new object
model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

# k-means clustering
km3 <- kmeans(matrix_product, centers = 2)
km4 <- km3$cluster 

# Create 3D plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km4)

The first matrix plot shows two clusters that are well separated from each other. In the second plot, one cluster is mainly occupied by the highest crime class. In the third plot, the clusters coincide fully with the k-means clustering with two centers.


Dimensionality reduction techniques

knitr::opts_chunk$set(message = FALSE)
date()
## [1] "Wed Nov 30 13:09:49 2022"
# BRING IN THE DATA

# access packages readr and dplyr
library(readr); library(dplyr)

# Set the working directory
setwd("C:/Users/yona2/Desktop/FT/IODS2022/RStudio/IODS-project/data")

human <- read.csv("human.csv")

# add countries as rownames
rownames(human) <- human$X

# remove the Country variable
human <- select(human, -X)


# EXPLORING THE DATA

# Access GGally and corrplot
library(GGally)
library(corrplot)

# visualize the variables
p <- ggpairs(human, progress = FALSE) # for some reason ggpairs produces error in knitting...
p

# compute the correlation matrix and visualize it with corrplot
cor(human) %>% round(digits = 3)
##           edu2ratio jobratio explife expyrsed    GNI    MMR    ABR propparl
## edu2ratio     1.000    0.010   0.576    0.593  0.430 -0.661 -0.529    0.079
## jobratio      0.010    1.000  -0.140    0.047 -0.022  0.240  0.120    0.250
## explife       0.576   -0.140   1.000    0.789  0.627 -0.857 -0.729    0.170
## expyrsed      0.593    0.047   0.789    1.000  0.624 -0.736 -0.704    0.206
## GNI           0.430   -0.022   0.627    0.624  1.000 -0.495 -0.557    0.089
## MMR          -0.661    0.240  -0.857   -0.736 -0.495  1.000  0.759   -0.089
## ABR          -0.529    0.120  -0.729   -0.704 -0.557  0.759  1.000   -0.071
## propparl      0.079    0.250   0.170    0.206  0.089 -0.089 -0.071    1.000

Distributions of the variables are rather skewed and the scales vary much from one variable to another. The variables are reasonably highly correlated with each other, except for labor participation ratio between genders (jobratio) and parliament representation (propparl) which correlated least with the other factors.

# PRINCIPAL COMPONENT ANALYSIS

# perform principal component analysis (with the SVD method)
pca <- prcomp(human)

# summary of pca
summary(pca)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca, choices = 1:2, col = c("grey40", "deeppink2"), cex = c(0.8, 1))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

# standardize the variables
human_std <- scale(human)

# print out summaries of the standardized variables
summary(human_std)
##    edu2ratio          jobratio          explife           expyrsed      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI               MMR               ABR             propparl      
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# compute the correlation matrix and visualize it with corrplot
cor(human_std) %>% round(digits = 3)
##           edu2ratio jobratio explife expyrsed    GNI    MMR    ABR propparl
## edu2ratio     1.000    0.010   0.576    0.593  0.430 -0.661 -0.529    0.079
## jobratio      0.010    1.000  -0.140    0.047 -0.022  0.240  0.120    0.250
## explife       0.576   -0.140   1.000    0.789  0.627 -0.857 -0.729    0.170
## expyrsed      0.593    0.047   0.789    1.000  0.624 -0.736 -0.704    0.206
## GNI           0.430   -0.022   0.627    0.624  1.000 -0.495 -0.557    0.089
## MMR          -0.661    0.240  -0.857   -0.736 -0.495  1.000  0.759   -0.089
## ABR          -0.529    0.120  -0.729   -0.704 -0.557  0.759  1.000   -0.071
## propparl      0.079    0.250   0.170    0.206  0.089 -0.089 -0.071    1.000
# perform principal component analysis (with the SVD method)
pca_std <- prcomp(human_std)

# summary of pca
summary(pca_std)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# draw a biplot of the principal component representation and the original variables
biplot(pca_std, choices = 1:2, col = c("grey40", "deeppink2"), cex = c(0.8, 1))

As expected, scaling reduces variances. Without scaling, PCA produces components the first of which seems to explain nearly all of the variation in the dataset and is occupied solely with GNI which has the largest variance among the variables. After scaling, PCA produces components the first four of which cumulatively explain 87% of the variation in the data. The arrows present nicely the observed correlations between variables.

# create and print out a summary of pca_std
s <- summary(pca_std)

# rounded percentanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

# print out the percentages of variance
pca_std
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## edu2ratio -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## jobratio   0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## explife   -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## expyrsed  -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## GNI       -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## MMR        0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## ABR        0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## propparl  -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                   PC6         PC7         PC8
## edu2ratio  0.17713316  0.05773644  0.16459453
## jobratio  -0.03500707 -0.22729927 -0.07304568
## explife   -0.42242796 -0.43406432  0.62737008
## expyrsed  -0.38606919  0.77962966 -0.05415984
## GNI        0.11120305 -0.13711838 -0.16961173
## MMR        0.17370039  0.35380306  0.72193946
## ABR       -0.76056557 -0.06897064 -0.14335186
## propparl   0.13749772  0.00568387 -0.02306476
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# create object pc_lab to be used as axis labels...
paste0(names(pca_pr), " (", pca_pr, "%)")
## [1] "PC1 (53.6%)" "PC2 (16.2%)" "PC3 (9.6%)"  "PC4 (7.6%)"  "PC5 (5.5%)" 
## [6] "PC6 (3.6%)"  "PC7 (2.6%)"  "PC8 (1.3%)"
# create object pc_lab2 to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
lbl <- c("Underdevelopment level:", "Gender equality:", "", "", "", "", "", "")
pc_lab2 <- paste0(lbl, " ", pc_lab)

# draw a biplot again
biplot(pca_std, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab2[1], ylab = pc_lab2[2])

The PC1 seems to be associated with the development level of the country: the main variables affecting it are maternal mortality rate, adolescent birth rate, life expectancy and expected years in secondary education. On the other hand, the PC2 seems to reflect gender inequality being mainly affected by labor ratio of genders and the parliamentary representation. It seems that the ratio of secondary education between genders is for some reason more related to the development level of a country than gender equality reflected in the labor market and governing.

# MULTIPLE CORRESPONDENCE ANALYSIS

# accessing libraries dplyr, tidyr and ggplot2
library(dplyr)
library(tidyr)
library(ggplot2)

# load tea data
 tea_time <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea_time.csv", 
                       sep = ",", header = T)

# create a numerical version of tea data
tea_timen <- tea_time
tea_timen$Tea[tea_timen$Tea == "black"] <- 1
tea_timen$Tea[tea_timen$Tea == "Earl Grey"] <- 2
tea_timen$Tea[tea_timen$Tea == "green"] <- 3
tea_timen$How[tea_timen$How == "alone"] <- 1
tea_timen$How[tea_timen$How == "lemon"] <- 2
tea_timen$How[tea_timen$How == "milk"] <- 3
tea_timen$How[tea_timen$How == "other"] <- 4
tea_timen$how[tea_timen$how == "tea bag"] <- 1
tea_timen$how[tea_timen$how == "tea bag+unpackaged"] <- 2
tea_timen$how[tea_timen$how == "unpackaged"] <- 3
tea_timen$sugar[tea_timen$sugar == "sugar"] <- 1
tea_timen$sugar[tea_timen$sugar == "No.sugar"] <- 2
tea_timen$where[tea_timen$where == "chain store"] <- 1
tea_timen$where[tea_timen$where == "chain store+tea shop"] <- 2
tea_timen$where[tea_timen$where == "tea shop"] <- 3
tea_timen$lunch[tea_timen$lunch == "lunch"] <- 1
tea_timen$lunch[tea_timen$lunch == "Not.lunch"] <- 2

tea_timen <- as.data.frame(apply(tea_timen, 2, as.numeric))

# look at the summaries and structure of the data
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : chr  "black" "black" "Earl Grey" "Earl Grey" ...
##  $ How  : chr  "alone" "milk" "alone" "alone" ...
##  $ how  : chr  "tea bag" "tea bag" "tea bag" "tea bag" ...
##  $ sugar: chr  "sugar" "No.sugar" "No.sugar" "sugar" ...
##  $ where: chr  "chain store" "chain store" "chain store" "chain store" ...
##  $ lunch: chr  "Not.lunch" "Not.lunch" "Not.lunch" "Not.lunch" ...
summary(tea_time)
##      Tea                How                how               sugar          
##  Length:300         Length:300         Length:300         Length:300        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##     where              lunch          
##  Length:300         Length:300        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character
str(tea_timen)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : num  1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : num  1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : num  1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: num  1 2 2 1 2 2 2 2 2 2 ...
##  $ where: num  1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: num  2 2 2 2 2 2 2 2 2 2 ...
summary(tea_timen)
##       Tea             How            how            sugar           where     
##  Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.00  
##  1st Qu.:2.000   1st Qu.:1.00   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.00  
##  Median :2.000   Median :1.00   Median :1.000   Median :2.000   Median :1.00  
##  Mean   :1.863   Mean   :1.62   Mean   :1.553   Mean   :1.517   Mean   :1.46  
##  3rd Qu.:2.000   3rd Qu.:2.00   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.00  
##  Max.   :3.000   Max.   :4.00   Max.   :3.000   Max.   :2.000   Max.   :3.00  
##      lunch      
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :2.000  
##  Mean   :1.853  
##  3rd Qu.:2.000  
##  Max.   :2.000
# visualize the dataset
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

p <- ggpairs(tea_timen, progress = FALSE)
p

# multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time[, 1:5], graph = FALSE)

# summary of the models
summary(mca)
## 
## Call:
## MCA(X = tea_time[, 1:5], graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.335   0.309   0.257   0.224   0.198   0.184   0.172
## % of var.             16.762  15.441  12.866  11.181   9.924   9.188   8.625
## Cumulative % of var.  16.762  32.204  45.069  56.250  66.174  75.362  83.987
##                        Dim.8   Dim.9  Dim.10
## Variance               0.141   0.105   0.075
## % of var.              7.031   5.249   3.734
## Cumulative % of var.  91.017  96.266 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.325  0.105  0.088 | -0.288  0.090  0.069 |  0.305
## 2                  | -0.259  0.067  0.037 | -0.074  0.006  0.003 |  0.789
## 3                  | -0.403  0.162  0.242 | -0.283  0.086  0.119 |  0.217
## 4                  | -0.580  0.334  0.481 | -0.328  0.116  0.154 | -0.290
## 5                  | -0.403  0.162  0.242 | -0.283  0.086  0.119 |  0.217
## 6                  | -0.403  0.162  0.242 | -0.283  0.086  0.119 |  0.217
## 7                  | -0.403  0.162  0.242 | -0.283  0.086  0.119 |  0.217
## 8                  | -0.259  0.067  0.037 | -0.074  0.006  0.003 |  0.789
## 9                  |  0.155  0.024  0.012 |  0.974  1.025  0.461 | -0.005
## 10                 |  0.521  0.270  0.142 |  0.845  0.770  0.373 |  0.612
##                       ctr   cos2  
## 1                   0.120  0.078 |
## 2                   0.806  0.343 |
## 3                   0.061  0.070 |
## 4                   0.109  0.120 |
## 5                   0.061  0.070 |
## 6                   0.061  0.070 |
## 7                   0.061  0.070 |
## 8                   0.806  0.343 |
## 9                   0.000  0.000 |
## 10                  0.485  0.196 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.294   0.073   4.681 |   0.198   0.627   0.013
## Earl Grey          |  -0.265   2.689   0.126  -6.147 |   0.088   0.319   0.014
## green              |   0.487   1.558   0.029   2.962 |  -0.956   6.515   0.113
## alone              |  -0.017   0.011   0.001  -0.405 |  -0.251   2.643   0.117
## lemon              |   0.668   2.926   0.055   4.059 |   0.458   1.497   0.026
## milk               |  -0.338   1.428   0.030  -3.010 |   0.220   0.659   0.013
## other              |   0.287   0.148   0.003   0.873 |   2.207   9.465   0.151
## tea bag            |  -0.607  12.466   0.482 -12.008 |  -0.336   4.134   0.147
## tea bag+unpackaged |   0.348   2.261   0.055   4.063 |   1.000  20.281   0.456
## unpackaged         |   1.959  27.484   0.524  12.511 |  -1.025   8.173   0.143
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                1.961 |   1.045  20.945   0.358  10.342 |
## Earl Grey            2.033 |  -0.462  10.689   0.386 -10.737 |
## green               -5.813 |   0.360   1.110   0.016   2.190 |
## alone               -5.905 |   0.150   1.136   0.042   3.534 |
## lemon                2.787 |  -1.561  20.842   0.301  -9.491 |
## milk                 1.963 |   0.093   0.141   0.002   0.828 |
## other                6.712 |   1.826   7.773   0.103   5.552 |
## tea bag             -6.637 |   0.069   0.211   0.006   1.367 |
## tea bag+unpackaged  11.677 |  -0.050   0.061   0.001  -0.586 |
## unpackaged          -6.548 |  -0.196   0.357   0.005  -1.249 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.115 0.421 |
## How                | 0.076 0.220 0.385 |
## how                | 0.708 0.503 0.008 |
## sugar              | 0.065 0.004 0.412 |
## where              | 0.701 0.702 0.061 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")

The tea_time variables are rather weakly associated with each other. The only exception is the association between how tea is used/purchased (as a tea bag or unpacked) and where it is bought (from a chain store or a tea shop). In the MCA factor map, the classes stay mostly close to the origo implying that very much distinction cannot be found based on these variables. Yet, it hints that some users tend to use their tea as green bought unpackaged from a tea shop, some like their tea as Earl Grey (bought as tea bags from chain stores) with sugar and milk and some like it black with lemon and no sugar. The classes ‘other’ and ‘chain store + tea shop’ and ‘tea bag + unpackaged’ seem to be distinctive from others but are rather impossible to interpret due to their mixed natures.